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25 SYSTEM FOR ACCELERATED COMMUNICATION," filed August 27, 1999, which in turn 

26 claims the benefit under 35 U.S .C. § 1 1 9(e)(1) of the Provisional Apphcation filed under 35 
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28 FOR ACCELERATED COMMUNICATION," Serial No. 60/098,296, filed August 27, 1998. 
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1 27, 1998. The subject matter of all four of the above-identified patent applications (including 

2 the subject matter in the Microfiche Appendix of U.S. Application Serial No. 09/464,283), and 

3 of the two above-identified provisional apphcations, is incorporated by reference herein, 
4 

5 REFERENCE TO COMPACT DISC APPENDIX 

6 The Compact Disc Appendix (CD Appendix), which is a part of the present disclosure, 

7 includes three folders, designated CD Appendix A, CD Appendix B, and CD Appendix C on 

8 the compact disc. CD Appendix A contains a hardware description language (verilog code) 

9 description of an embodiment of a receive sequencer: CD Appendix B contains microcode 

10 executed by a processor that operates in conjunction with the receive sequencer of CD 

1 1 Appendix A. CD Appendix C contains a device driver executable on the host as well as ATCP 

12 code executable on the host. A portion of the disclosure of this patent document contains 

13 material (other lhan any portion of the "free BSD" stack included in CD Appendix C) which is 

14 subject to copyright protection. The copyright owner of that material has no objection to the 

1 5 facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears 

16 in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright 

17 rights. 
18 

19 TECHNICAL FIELD 

20 The present invention relates generally to computer or other networks, and more 

2 1 particularly to processing of infomiation communicated between hosts such as computers 

22 connected to a network. 
23 

24 BACKGROUND 

25 The advantages of network computing are increasingly evident. The convenience and 

26 efficiency of providing information, communication or computational power to individuals at 

27 their personal computer or other end user devices has led to rapid growth of such network 

28 computing, including internet as well as intranet devices and apphcations. 

29 As is well known, most network computer communication is accomplished with the aid of 

30 a layered software architecture for moving information between host computers connected to 

31 the network. The layers help to segregate information into manageable segments, the general 

32 functions of each layer often based on an international standard called Open Systems 
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1 Interconnection (OSI). OSI sets forth seven processing layers through which information may 

2 pass when received by a host in order to be presentable to an end user. Similarly, transmission 

3 of information from a host to the network may pass through those seven processing layers in 

4 reverse order. Each step of processing and service by a layer may include copying the 

5 processed information. Another reference model that is widely implemented, called TCP/IP 

6 (TCP stands for transport control protocol, while IP denotes intemet protocol) essentially 

7 employs five of the seven layers of OSI. 

8 Networks may include, for instance, a high-speed bus such as an Ethemet connection or an 

9 intemet connection between disparate local area networks (LANs), each of which includes 
10 multiple hosts, or any of a variety of other known means for data transfer between hosts. 

Q 1 1 According to the OSI standard, physical layers are connected to the network at respective 

12 hosts, the physical layers providing transmission and receipt of raw data bits via the network, 

y 13 A data link layer is serviced by the physical layer of each host, the data link layers providing 

£ 14 frame division and error correction to the data received from the physical layers, as well as 

^ 15 processing acknowledgment frames sent by the receiving host. A network layer of each host is 

N= 16 serviced by respective data link layers, the network layers primarily controlling size and 

|J. 17 coordination of subnets of packets of data. 

^18 A transport layer is serviced by each network layer and a session layer is serviced by each 

M= 19 transport layer within each host. Transport layers accept data from their respective session 

20 layers and split the data into smaller units for transmission to the other host's transport layer, 

21 which concatenates the data for presentation to respective presentation layers. Session layers 

22 allow for enhanced communication control between the hosts. Presentation layers are serviced 

23 by their respective session layers, the presentation layers translating between data semantics 

24 and syntax which may be peculiar to each host and standardized structures of data 

25 representation. Compression and/or encryption of data may also be accomplished at the 

26 presentation level. Application layers are serviced by respective presentation layers, the 

27 application layers translating between programs particular to individual hosts and standardized 

28 programs for presentation to either an application or an. end user. The TCP/IP standard 

29 includes the lower four layers and application layers, but integrates the fixnctions of session 

30 layers and presentation layers into adjacent layers. Generally speaking, application, 

31 presentsition and session layers are defined as upper layers, while transport, network and data 

32 link layers are defined as lower layers. 
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1 The rules and conventions for each layer are called the protocol of that layer, and since the 

2 protocols and general functions of each layer are roughly equivalent in various hosts, it is 

3 useful to think of communication occurring directly between identical layers of different hosts, 

4 even though these peer layers do not directly communicate without information transferring 

5 sequentially through each layer below. Each lower layer performs a service for the layer 

6 immediately above it to help with processing the communicated information. Each layer saves 

7 the information for processing and service to the next layer. Due to the multiplicity of 

8 hardware and software architectures, devices and programs commonly employed, each layer is 

9 necessary to insure that the data can make it to the intended destination in the appropriate 
1 0 form, regardless of variations in hardware and software that may intervene. 

011 In preparing data for transmission from a first to a second host, some control data is added 
Si 12 at each layer of the first host regarding the protocol of that layer, the control data being 

W 13 indistinguishable from the original (payload) data for all lower layers of that host. Thus an 

£ 14 application layer attaches an application header to the payload data and sends the combined 

7^ 15 data to the presentation layer of the sending host, which receives the combined data, operates 

H= 16 on it and adds a presentation header to the data, resulting in another combined data packet. 

12 17 The data resulting from combination of payload data, application header and presentation 
^ 18 header is then passed to the session layer, which performs required operations including 

M= 19 attaching a session header to the data and presenting the resulting combination of data to the 

20 transport layer. This process continues as the information moves to lower layers, with a 

21 transport header, network header and data link header and trailer attached to the data at each of 

22 those layers, with each step typically including data moving and copying, before sending the 

23 data as bit packets over the network to the second host, 

24 The receiving host generally performs the converse of the above-described process, 

25 beginning with receiving the bits from the network, as headers are removed and data processed 

26 in order from the lowest (physical) layer to the highest (application) layer before transmission 

27 to a destination of the receiving host. Each layer of the receiving host recognizes and 

28 manipulates only the headers associated with that layer, since to that layer the higher layer 

29 control data is included with and indistinguishable from the payload data. Multiple interrupts, 

30 valuable central processing unit (CPU) processing time and repeated data copies may also be 

3 1 necessary for the receiving host to place the data in an appropriate form at its intended 

32 destination. 
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1 The above description of layered protocol processing is simplified, as college-level 

2 textbooks devoted primarily to this subject are available, such as Computer Networks, Third 

3 Edition (1 996) by Andrew S. Tanenbaum, which is incorporated herein by reference. As 

4 defined in that book, a computer network is an interconnected collection of autonomous 

5 computers, such as intemet and intranet devices, including local area networks (LANs), wide 

6 area networks (WANs), asynchronous transfer mode (ATM), ring or token ring, wired, 

7 wireless, satellite or other means for providing communication capabihty between separate 

8 processors. A computer is defined herein to include a device having both logic and memory 

9 fimctions for processing data, while computers or hosts connected to a network are said to be 
10 heterogeneous if they function according to different operating devices or communicate via 

M 1 1 different architectures. 

Q 12 As networks grow increasingly popular and the information communicated thereby 

1^13 becomes increasingly complex and copious, the need for such protocol processing has 

ry 14 increased. It is estimated that a large firaction of the processing power of a host CPU may be 

S 15 devoted to controlling protocol processes, diminishing the ability of that CPU to perform other 

- 16 tasks. Network interface cards have been developed to help with the lowest layers, such as the 

fy 17 physical and data link layers. It is also possible to increase protocol processing speed by 

n 18 simply adding more processing power or CPUs according to conventional arrangements. This 

0 19 solution, however, is both awkward and expensive. But the complexities presented by various 

20 networks, protocols, architectures, operating devices and applications generally require 

21 extensive processing to afford communication capability between various network hosts. 

22 

23 SUMMARY OF THE INVENTION 

24 The current invention provides a device for processing network communication that greatly 

25 increases the speed of that processing and the efficiency of transferring data being 

26 communicated. The invention has been achieved by questioning the long-standing practice of 

27 performing multilayered protocol processing on a general-purpose processor. The protocol 

28 processing method and architecture that results effectively collapses Ihe layers of a connection- 

29 based, layered architecture such as TCP/IP into a single wider layer which is able to send 

30 network data more directly to and from a desired location or buffer on a host. This accelerated 

31 processing is provided to a host for both transmitting and receiving data, and so improves 
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1 performance whether one or both hosts involved in an exchange of information have such a 

2 feature. 

3 The accelerated processing includes employing representative control instructions for a 

4 given message that allow data from the message to be processed via a fast-path which accesses 

5 message data directly at its source or delivers it directly to its intended destination. This fast- 

6 path bypasses conventional protocol processing of headers that accompany the data. The fast- 

7 path employs a specialized microprocessor designed for processing network communication, 

8 avoiding the delays and pitfalls of conventional softv/are layer processing, such as repeated 

9 copying and interrupts to the CPU. In effect, the fast-path replaces the states that are 

10 traditionally found in several layers of a conventional network stack with a single state 

1 1 machine encompassing all those layers, in contrast to conventional rales that require rigorous 

12 differentiation and separation of protocol layers. The host retains a sequential protocol 

13 processing stack which can be employed for setting up a fast-path connection or processing 

14 message exceptions. The specialized microprocessor and the host intelligently choose whether 

15 a given message or portion of a message is processed by the microprocessor or the host stack. 

16 One embodiment is a method of generating a fast-path response to a packet received onto a 

17 network interface device where the packet is received over a TCP/IP network connection and 

18 where the TCP/IP network connection is identified at least in part by a TCP source port, a TCP 

19 destination port, an IP source address, and an IP destination address. The method comprises: 

20 1) Examining the packet and determining from the packet the TCP source port, the TCP 

21 destination port, the IP source address, and the IP destination address; 2) Accessing an 

22 appropriate template header stored on the network interface device. The template header has 

23 TCP fields and IP fields; 3) Employing a finite state machine that implements both TCP 

24 protocol processing and IP protocol processing to fill in the TCP fields and IP fields of the 

25 template header; and 4) Transmitting the fast-path response from the network interface device. 

26 The fast-path response includes the filled in template header and a payload. The finite state 

27 machine does not entail a TCP protocol processing layer and a discrete IP protocol processing 

28 layer where the TCP and BP layers are executed one after another in sequence. Rather, the 

29 fmite state machine covers both TCP and IP protocol processing layers. 

30 In one embodiment, buffer descriptors that point to packets to be transmitted are pushed 

31 onto a plurality of transmit queues. A transmit sequencer pops the transmit queues and obtains 

32 the buffer descriptors. The buffer descriptors are then used to retrieve the packets from buffers 
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1 where the packets are stored. The retrieved packets are then transmitted from the network 

2 interface device. In one embodiment, there are two transmit queues, one having a higher 

3 transmission priority than the other. Packets identified by buffer descriptors on the higher 

4 priority transmit queue are transmitted from the network interface device before packets 

5 identified by the lower priority transmit queue. 

6 Other structures and methods are disclosed in the detailed description below. This 

7 summary does not purport to define the invention. The invention is defined by the claims. 
8 

9 BRIEF DESCRIPTION OF THE DRAWINGS 

10 FIG. 1 is a plan view diagram of a device of the present invention, including a host 

1 1 computer having a communication-processing device for accelerating network 

12 conmumication. 

13 FIG. 2 is a diagram of information flow for the host of FIG. 1 in processing network 

14 communication, including a fast-path, a slow-path and a transfer of connection context 

15 between the fast and slow-patiis. 

16 FIG. 3 is a flow chart of message receiving according to the present invention. 

17 FIG. 4 A is a diagram of information flow for the host of FIG. 1 receiving a message packet 

1 8 processed by the slow-path. 

19 FIG, 4B is a diagram of information flow for the host of FIG. 1 receiving an initial message 

20 packet processed by the fast-path. 

21 FIG. 40 is a diagram of information flow for the host of FIG. 4B receiving a subsequent 

22 message packet processed by the fast-path. 

23 FIG. 4D is a diagram of information flow for the host of FIG. 40 receiving a message 

24 packet having an error that causes processing to revert to the slow-path. 

25 ' FIG. 5 is a diagram of information flow for the host of FIG. 1 transmitting a message by 

26 either the fast or slow-paths. 

27 FIG. 6 is a diagram of information flow for a first embodiment of an intelligent network 

28 interface card (INIO) associated with a client having a TOP/IP processing stack. 

29 FIG. 7 is a diagram of hardware logic for tiie INIO embodiment shown in FIG. 6, including 

30 a packet control sequencer and a fly-by sequencer. 

31 FIG, 8 is a diagram of the fly-by sequencer of FIG. 7 for analyzing header bytes as they are 

32 received by the INIO. 
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1 FIG. 9 is a diagram of information flow for a second embodiment of an INIC associated 

2 with a server having a TCP/IP processing stack. 

3 FIG. 1 0 is a diagram of a command driver installed in the host of FIG. 9 for creating and 

4 controlling a communication control block for the fast-path. 

5 FIG. 1 1 is a diagram of the TCP/IP stack and command driver of FIG. 10 configured for 

6 NetBios communications. 

7 FIG. 12 is a diagram of a communication exchange between the client of FIG. 6 and the 

8 server of FIG. 9. 

9 FIG. 13 is a diagram of hardware functions included in the INIC of FIG. 9. 

10 FIG. 14 is a diagram of a trio of pipelined microprocessors included in the INIC of FIG. 13, 

1 1 including three phases with a processor in each phase. 

12 FIG. 15A is a diagram of a first phase of the pipelined microprocessor of FIG. 14. 

13 FIG. 1 5B is a diagram of a second phase of the pipelined microprocessor of FIG. 14. 

14 FIG. 1 5C is a diagram of a third phase of the pipelined microprocessor of FIG. 14. 

15 FIG. 1 6 is a diagram of a plurality of queue storage units that interact with the 

1 6 microprocessor of FIG, 1 4 and include SRAM and DRAM. 

17 FIG. 17 is a diagram of a set of status registers for the queues storage units of FIG. 16. 

18 FIG. 1 8 is a diagram of a queue manager, which interacts, with the queue storage units and 

19 status registers of FIG. 16 and FIG. 17. 

20 FIGs. 19A-D are diagrams of various stages of a least-recently-used register that is 

2 1 employed for allocating cache memory. 

22 FIG. 20 is a diagram of the devices used to operate the least-recently-used register of FIGs. 

23 19A-D. 

24 FIG. 21 is another diagram of Intelligent Network Interface Card (INIC) 200 of Figure 13. 

25 FIG. 22 is a diagram of the receive sequencer of FIG. 21 . 

26 FIG. 23 is a diagram illustrating a "fast-path" transfer of data of a multi-packet message 

27 from INIC 200 to a destination 23 1 1 in host 20. 
28 

29 DETAILED DESCRIPTION 

30 FIG. 1 shows a host 20 of the present invention coimected by a network 25 to a remote host 

31 22. The increase in processing speed achieved by the present invention can be provided with 

32 an intelligent network interface card (INIC) that is easily and affordably added to an existing 
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1 host, or with a communication processing device (CPD) that is integrated into a host, in either 

2 case freeing the host CPU from most protocol processing and allowing improvements in other 

3 tasks performed by that CPU. The host 20 in a first embodiment contains a CPU 28 and a 

4 CPD 30 connected by a PCI bus 33. The CPD 30 includes a microprocessor designed for 

5 processing commimication data and memory buffers controlled by a direct memory access 

6 (DMA) unit. Also connected to the PCI bus 33 is a storage device 35, such as a semiconductor 

7 memory or disk drive, along with any related controls. 

8 Referring additionally to FIG. 2, the host CPU 28 controls a protocol processing stack 44 

9 housed in storage 35, the stack including a data link layer 36, network layer 38, transport layer 
If 10 40, upper layer 46 and an upper layer interface 42. The upper layer 46 may represent a 

0 1 1 session, presentation and/or application layer, depending upon the particular protocol being 

{II 12 employed and message communicated. The upper layer interface 42, along with the CPU 28 

W 13 and any related controls can send or retrieve a file to or from the upper layer 46 or storage 35, 

O 14 as shown by arrow 48. A connection context 50 has been created, as will be explained below, 

l- 15 the context summarizing various features of the connection, such as protocol type and source 

fy 16 and destination addresses for each protocol layer. The context may be passed between an 

\i 17 interface for the session layer 42 and the CPD 30, as shown by arrows 52 and 54, and stored as 

H 18 a communication control block (CCB) at either CPD 30 or storage 35, 

SsaSs 

19 ' When the CPD 30 holds a CCB defining a p^cular connection, data received by the CPD 

20 from the network and pertaining to the connection is referenced to that CCB and can then be 

21 sent directly to storage 35 according to a fast-path 58, bypassing sequential protocol 

22 processing by the data link 36, network 38 and transport 40 layers. Transmitting a message, 

23 such as sending a file from storage 35 to remote host 22, can also occur via tiie fast-path 58, in 

24 which case the context for the file data is added by the CPD 30 referencing a CCB, rather than 

25 by sequentially adding headers during processing by the transport 40, network 38 and data link 

26 36 layers. The DMA controllers of the CPD 30 perform these transfers between CPD and 

27 storage 35. 

28 The CPD 30 collapses multiple protocol stacks each having possible separate states into a 

29 single state machine for fast-path processing. As a result, exception conditions may occur that 

30 are not provided for in the single state machine, primarily because such conditions occur 

3 1 infrequently and to deal with them on the CPD would provide little or no performance benefit 

32 to the host. Such exceptions can be CPD 30 or CPU 28 initiated. An advantage of the 
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1 invention includes the manner in which unexpected situations that occur on a fast-path CCB 

2 are handled. The CPD 30 deals with these rare situations by passing back or flushing to the 

3 host protocol stack 44 the CCB and any associated message frames involved, via a control 

4 negotiation. The exception condition is then processed in a conventional manner by the host 

5 protocol stack 44. At some later time, usually directly after the handling of the exception 

6 condition has completed and fast-path processing can resume, the host stack 44 hands the CCB 

7 back to the CPD. 

8 This fallback capability enables the performance-impacting functions of the host protocols 

9 to be handled by the CPD network microprocessor, while the exceptions are dealt with by the 

10 host stacks, the exceptions being so rare as to negligibly effect overall performance. The 

1 1 custom designed net\vork microprocessor can have independent processors for transmitting 

12 and receiving network information, and further processors for assisting and queuing. A 

13 preferred microprocessor embodinient includes a pipelined trio of receive, transmit and utility 

14 processors. DMA controllers are integrated into the implementation and work in close concert 

15 with the network microprocessor to quickly move data between buffers adjacent to the 

16 controllers and other locations such as long term storage. Providing buffers logically adjacent 

17 to the DMA controllers avoids unnecessary loads on the PCI bus. 

18 FIG. 3 diagrams the general flow of messages received according to the current invention. 

19 A large TCP/IP message such as a file transfer may be received by the host from the network 

20 in a number of separate, approximately 64 KB transfers, each of which may be split into many, 

21 approximately 1 .5 KB frames or packets for transmission over a network. Novell NetWare 

22 protocol suites running Sequenced Packet Exchange Protocol (SPX) or NetWare Core Protocol 

23 (NCP) over Intemetwork Packet Exchange (IPX) work in a similar fashion. Another form of 

24 data communication which can be handled by the fast-path is Transaction TCP (hereinafter 

25 T/TCP or TTCP), a version of TCP which initiates a connection with an initial transaction 

26 request after which a reply containing data may be sent according to the coimection, rather 

27 than initiating a coimection via a several-message initialization dialogue and then transferring 

28 data with later messages. In any of the transfers typified by these protocols, each packet 

29 conventionally includes a portion of the data being transferred, as well as headers for each of 

30 the protocol layers and markers for positioning the packet relative to the rest of the packets of 

31 this message. 
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1 When a message packet or frame is received 47 from a network by the CPD, it is first 

2 validated by a hardware assist. This includes determining the protocol types of the various 

3 layers, verifying relevant checksums, and summarizing 57 these findings into a status word or 

4 words. Included in these words is an indication whether or not the frame is a candidate for 

5 fast-path data flow. Selection 59 of fast-path candidates is based on whether the host may 

6 benefit from this message connection being handled by the CPD, which includes determining 

7 whether the packet has header bytes indicating particular protocols, such as TCP/IP or 

8 SPX/IPX for example. The small percent of frames that are not fast-patihi candidates are sent 

9 61 to the host protocol stacks for slow-path protocol processing. Subsequent network 

10 microprocessor work with each fast-path candidate determines whether a fast-path connection 

1 1 such as a TCP or SPX CCB is aheady extant for that candidate, or w^hether that candidate may 

12 be used to set up a new fast-path connection, such as for a TTCP/IP transaction. The 

13 validation provided by the CPD provides acceleration whether a frame is processed by the fast- 

14 path or a slow-path, as only error free, validated frames are processed by the host CPU even 

15 for the slow-path processing. 

16 AH received message frames which have been determined by the CPD hardware assist to be 

17 fast-path candidates are examined 53 by the network microprocessor or INIC comparator 

18 circuits to detemiine whether they match a CCB held by the CPD. Upon confirming such a 

19 match, the CPD removes lower layer headers and sends 69 the remaining application data from 

20 the frame directly into its final destination in the host using direct memory access (DMA) units 

21 of the CPD, This operation may occur immediately upon receipt of a message packet, for 

22 example when a TCP connection already exists and destination buffers have been negotiated, 

23 or it may first be necessary to process an initial header to acquire a new set of final destination 

24 addresses for this transfer. In this latter case, the CPD will queue subsequent message packets 

25 while waiting for the destination address, and then DMA the queued appUcation data to that 

26 destination. 

27 A fast-path candidate that does not match a CCB may be used to set up a new fast-path 

28 connection, by sending 65 the frame to the host for sequential protocol processing. In this 

29 case, the host uses this frame to create 51a CCB, which is then passed to the CPD to control 

30 subsequent frames on that connection. The CCB, which is cached 67 in the CPD, includes 

31 control and state information pertinent to all protocols that would have been processed had 

32 conventional software layer processing been employed. The CCB also contains storage space 
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1 for per-transfer infomation used to facilitate moving application-level data contained within 

2 subsequent related message packets directly to a host appUcation in a form available for 

3 immediate usage. The CPD takes command of connection processing upon receiving a CCB 

4 for that connection from the host. 

5 As shown more specifically in FIG. 4A, when a message packet is received from the remote 

6 host 22 via network 25, the packet enters hardware receive logic 32 of the CPD 30, which 

7 checksums headers and data, and parses the headers, creating a word or words which identify 

8 the message packet and status, storing the headers, data and word temporarily in memory 60. 

9 As well as validating the packet, the receive logic 32 indicates with the word whether this 

10 packet is a candidate for fast-path processing. FIG. 4A depicts the case in which the packet is 

11 not a fast-path candidate, in which case the CPD 30 sends the vahdated headers and data from 

12 memory 60 to data link layer 36 along an internal bus for processing by the host CPU, as 

13 shown by arrow 56. The packet is processed by the host protocol stack 44 of data link 36, 

14 network 38, transport 40 and session 42 layers, and data (D) 63 from the packet may then be 

15 sent to storage 35, as shown by arrow 65. 

16 FIG. 4B, depicts the case in which the receive logic 32 of the CPD determines that a 

17 message packet is a candidate for fast-path processing, for example by deriving from the 

18 packet's headers that the packet belongs to a TCP/IP, TTCP/IP or SPX/IPX message. A 

19 processor 55 in the CPD 30 then checks to see whether the word tiiat summarizes the fast-path 

20 candidate matches a CCB held in a cache 62. Upon finding no match for this packet, the CPD 

21 sends the validated packet from memory 60 to the host protocol stack 44 for processing. Host 

22 stack 44 may use this packet to create a connection context for the message, including finding 

23 and reserving a destination for data from the message associated with the packet, the context 

24 taking the form of a CCB. The present embodiment employs a single specialized host stack 44 

25 for processing both fast-path and non-fast-path candidates, while in an embodiment described 

26 below fast-pafh candidates are processed by a different host stack than non-fast-path 

27 candidates. Some data (Dl) 66 from that initial packet may optionally be sent to the 

28 destination in storage 35, as shown by arrow 68. The CCB is then sent to the CPD 30 to be 

29 saved in cache 62, as shown by arrow 64. For a traditional connection-based message such as 

30 typified by TCP/IP, the initial packet may be part of a connection initialization dialogue that 

31 transpires between hosts before the CCB is created and passed to the CPD 30. 
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1 Referring now to FIG. 4C, when a subsequent packet from the same connection as the 

2 initial packet is received from the network 25 by CPD 30, the packet headers and data are 

3 validated by the receive logic 32, and the headers are parsed to create a summary of the 

4 message packet and a hash for finding a corresponding CCB, the summary and hash contained 

5 in a word or words. The word or words are temporarily stored in memory 60 along with the 

6 packet. The processor 55 checks for a match between the hash and each CCB that is stored in 

7 the cache 62 and, finding a match, sends the data (D2) 70 via a fast-path directly to the 

8 destination in storage 35, as shown by arrow 72, bypassing the session layer 42, transport layer 

9 40, network layer 3 8 and data link layer 36. The remaining data packets from the message can 

10 also be sent by DMA directly to storage, avoiding the relatively slow protocol layer processing 

1 1 and repeated copying by the CPU stack 44. 

12 FIG. 4D shows the procedure for handling the rare instance when a message for which a 

13 fast-path connection has been established, such as shown in FIG. 4C, has a packet that is not 

14 easily handled by the CPD. In this case the packet is sent to be processed by the protocol stack 

15 44, which is handed the CCB for that message from cache 62 via a control dialogue with the 

16 CPD, as shown by arrow 76, signaling to the CPU to take over processing of that message. 

17 Slow-path processing by the protocol stack then results in data (D3) 80 from the packet being 

18 sent, as shown by arrow 82, to storage 35. Once the packet has been processed and the error 

19 situation corrected, the CCB can be handed back via a control dialogue to the cache 62, so that 

20 payload data from subsequent packets of that message can again be sent via the fast-path of the 

21 CPD 30. Thus the CPU and CPD together decide whether a given message is to be processed 

22 according to fast-path hardware processing or more conventional software processing by the 

23 CPU. 

24 Transmission of a message from the host 20 to the network 25 for delivery to remote host 22 

25 also can be processed by either sequential protocol software processing via the CPU or 

26 accelerated hardware processing via the CPD 30, as shown in FIG. 5. A message (M) 90 that 

27 is selected by CPU 28 from storage 35 can be sent to session layer 42 for processing by stack 

28 44, as shown by arrows 92 and 96. For the situation in which a connection exists and the CPD 

29 30 akeady has an appropriate CCB for the message, however, data packets can bypass host 

30 stack 44 and be sent by DMA directly to memory 60, with the processor 55 adding to each 

31 data packet a single header containing all the appropriate protocol layers, and sending the 

32 resulting packets to the network 25 for transmission to remote host 22. This fast-path 
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1 transmission can greatly accelerate processing for even a single packet, with the acceleration 

2 multiplied for a larger message. 

3 A message for which a fast-path connection is not extant thus may benefit from creation of 

4 a CCB with appropriate control and state information for guiding fast-path transmission. For a 

5 traditional comiection-based message, such as typified by TCP/IP or SPX/IPX, the CCB is 

6 created during connection initialization dialogue. For a quick-connection message, such as 

7 typified by TTCP/IP, the CCB can, be created with the same transaction that transmits payload 

8 data. In this case, the transmission of payload data may be a reply to a request that was used to 

9 set up the fast-path connection. In any case, the CCB provides protocol and status information 

10 regarding each of the protocol layers, including which user is involved and storage space for 

1 1 per-transfer information. The CCB is created by protocol stack 44, which then passes the CCB 

12 to the CPD 30 by writing to a command register of tiie CPD, as shown by arrow 98. Guided 

13 by the CCB, the processor 55 moves network fi:ame-sized portions of the data fi-om the source 

14 in host memory 35 into its own memory 60 using DMA, as depicted by arrow 99. The 

15 processor 55 then prepends appropriate headers and checksums to the data portions, and 

16 transmits the resulting frames to the network 25, consistent with the restrictions of the 

17 associated protocols. After the CPD 30 has received an acknowledgement that all the data has 

1 8 reached its destination, tiie CPD will then notify the host 35 by writing to a response buffer. 

19 Thus, fast-path transmission of data communications also relieves the host CPU of per-frame 

20 processing. A vast majority of data transmissions can be sent to the network by the fast-path. 

21 Both the input and output fast-paths attain a huge reduction in interrupts by fimctioning at an 

22 upper layer level, i.e., session level or higher, and interactions between the network 

23 microprocessor and the host occur using the fiill transfer sizes which that upper layer wishes to 

24 make. For fast-path communications, an interrupt only occurs (at the most) at the beginning 

25 and end of an entire upper-layer message transaction, and there are no interrupts for the 

26 sending or receiving of each lower layer portion or packet of that transaction. 

27 A simplified intelligent network interface card (INIC) 150 is shown in FIG. 6 to provide a 

28 network interface for a host 152. Hardware logic 171 of the INIC 150 is connected to a 

29 network 1 55, with a peripheral bus (PCI) 1 57 connecting tiie INIC and host. The host 1 52 in 

30 this embodiment has a TCP/IP protocol stack, which provides a slow-path 158 for sequential 

31 software processing of message frames received from the network 1 55. The host 1 52 protocol 

32 stack includes a data link layer 160, network layer 162, a transport layer 164 and an 
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1 application layer 166, which provides a source or destination 168 for the communication data 

2 in the host 1 52. Other layers which are not shown, such as session and presentation layers, 

3 may also be included in the host stack 1 52, and the source or destination may vary depending 

4 upon the nature of the data and may actually be the application layer. 

5 The INIC 150 has a network processor 1 70 which chooses between processing messages 

6 along a slow-path 1 58 that includes the protocol stack of the host, or along a fast-path 1 59 that 

7 bypasses the protocol stack of the host. Each received packet is processed on the fly by 

8 hardware logic 171 contained in INIC 150, so that all of the protocol headers for a packet can 

9 be processed without copying, moving or storing the data between protocol layers. The 

10 hardware logic 171 processes the headers of a given packet at one time -as packet bytes pass 

1 1 through the hardware, by categorizing selected header bytes. Results of processing the 

12 selected bytes help to determine which other bytes of the packet are categorized, until a 

13 summary of the packet has been created, including checksum validations. The processed 

14 headers and data from the received packet are then stored in INIC storage 1 85, as well as the 

15 word or words summarizing the headers and status of the packet. For a network storage 

16 configuration, the INIC 150 may be connected to a peripheral storage device such as a disk 

17 drive which has an IDE, SCSI or similar interface, with a file cache for the storage device 

18 residing on the memory 1 85 of the INIC 150. Several such network interfaces may exist for a 

19 host, with each interface ha\dng an associated storage device. 

20 The hardware processing of message packets received by INIC 150 from network 155 is 

21 shown in more detail in FIG. 7. A received message packet first enters a media access 

22 controller 172, which controls INIC access to the network and receipt of packets and can 

23 provide statistical information for network protocol management. From there, data flows one 

24 byte at a time into an assembly register 174, which in this example is 128 bits wide. The data 

25 is categorized by a fly-by sequencer 178, as will be explained in more detail with regard to 

26 FIG. 8, which examines the bytes of a packet as they fly by, and generates status from those 

27 bytes that will be used to summarize the packet. The status thus created is merged with the 

28 data by a multiplexor 1 80 and the resulting data stored in SRAM 1 82. A packet control 

29 sequencer 176 oversees the fly-by sequencer 178, examines information from the media access 

30 controller 172, counts the bytes of data, generates addresses, moves status and manages the 

3 1 movement of data from the assembly register 1 74 to SRAM 1 82 and eventually DRAM 1 88. 

32 The packet control sequencer 1 76 manages a buffer in SRAM 1 82 via SRAM controller 1 83, 
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1 and also indicates to a DRAM controller 1 86 when data needs to be moved jfrom SRAM 1 82 to 

2 a buffer in DRAM 1 88. Once data-movement for the packet has been completed and all the 

3 data has been moved to the buffer in DRAM 188, the packet control sequencer 1 76 will move 

4 the status that has been generated in the fly-by sequencer 178 out to the SRAM 182 and to the 

5 beginning of the DRAM 1 88 buffer to be prepended to the packet data. The packet control 

6 sequencer 1 76 then requests a queue manager 1 84 to enter a receive buffer descriptor into a 

7 receive queue, which in turn notifies the processor 1 70 that the packet has been processed by 

8 hardware logic 171 and its status summarized. 

9 FIG. 8 shows that the fly-by sequencer 1 78 has several tiers, with each tier generally 

10 focusing on a particular portion of the packet header and thus on a particular protocol layer, for 

1 1 generating status pertaining to that layer. The fly-by sequencer 178 in this embodiment 

12 includes a media access control sequencer 191, a network sequencer 192, a transport sequencer 

13 194 and a session sequencer 195. Sequencers pertaining to higher protocol layers can 

14 additionally be provided. The fly-by sequencer 178 is reset by the packet control sequencer 

15 176 and given pointers by the packet control sequencer that tell the fly-by sequencer whether a 

16 given byte is available from the assembly register 174. The media access control sequencer 

17 191 determines, by looking at bytes 0-5, that a packet is addressed to host 152 rather than or in 

18 addition to another host. Offsets 12 and 13 of the packet are also processed by the media 

19 access control sequencer 191 to determine the type field, for example whether the packet is 

20 Ethemet or 802.3. If the type field is Ethernet those bytes also tell the media access control 

21 sequencer 191 the packet's network protocol type. For the 802.3 case, those bytes instead 

22 indicate the length of the entire frame, and the media access control sequencer 191 will check 

23 eight bytes further into the packet to determine the network layer type. 

24 For most packets the network sequencer 1 92 validates that the header length received has 

25 the correct length, and checksums the network layer header. For fast-path candidates the 

26 network layer header is known to be IP or IPX from analysis done by the media access control 

27 sequencer 191. Assuming for example that the type field is 802.3 and the network protocol is 

28 IP, the network sequencer 192 analyzes the first bytes of the network layer header, which will 

29 begin at hyte 22, in order to determine IP type. The first bytes of the IP header will be 

30 processed by the network sequencer 1 92 to determine what IP type the packet involves. 

3 1 Determining that the packet involves, for example, IP version 4, directs further processing by 

32 the network sequencer 192, which also looks at the protocol type located ten bytes into the IP 
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1 head^ for an indication of the transport header protocol of the packet. For example, for IP 

2 over Ethernet, the IP header begins at offset 14, and the protocol type byte is offset 23, which 

3 will be processed by network logic to determine whether the transport layer protocol is TCP, 

4 for example. From the length of the network layer header, which is typically 20-40 bytes, 

5 network sequencer 192 determines the beginning of the packet's transport layer header for 

6 validating the transport layer header. Transport sequencer 194 may generate checksums for 

7 the transport layer header and data, which may include information from the IP header in tihe 

8 case of TCP at least. 

9 Continuing with the example of a TCP packet, transport sequencer 194 also analyzes the 

10 first few bytes in the transport layer portion of the header to determine, in part, the TCP source 

1 1 and destination ports for the message, such as whether the packet is NetBios or other 

12 protocols. Byte 12 of the TCP header is processed by the transport sequencer 194 to determine 

13 and validate the TCP header length. Byte 13 of the TCP header contains flags that may, aside 

14 from ack flags and push flags, indicate unexpected options, such as reset and fin, that may 

15 cause the processor to categorize this packet as an exception. TCP offset bytes 16 and 17 are 

16 the checksum, which is pulled out and stored by the hardware logic 171 while the rest of the 

17 frame is validated against the checksum. 

18 Session sequencer 195 determines the length of the session layer header, which in the case 

19 of NetBios is only four bytes, two of which tell the length of the NetBios payload data, but 

20 which can be much larger for other protocols. The session sequencer 1 95 can also be used to 

21 categorize the type of message as read or write, for example, for which the fast-path may be 

22 particularly beneficial. Further upper layer logic processing, depending upon the message 

23 type, can be performed by the hardware logic 171 of packet control sequencer 176 and fly-by 

24 sequencer 178 . Thus hardware logic 171 intelligently directs hardware processing of the 

25 headers by categorization of selected b3^es from a single stream of bytes, with the status of the 

26 packet being built from classifications determined on the fly. Once the packet control 

27 sequence 1 76 detects that all of the packet has been processed by the fly-by sequencer 1 78, 

28 the packet control sequencer 176 adds the status information generated by the fly-by sequencer 

29 178 and any status information generated by the packet control sequencer 176, and prepends 

30 (adds to the front) that status information to the packet, for convenience in handling the packet 

31 by the processor 170. The additional status information generated by the packet control 

32 sequencer 176 includes media access controller 172 status information and any errors 
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1 discovered, or data overflow in either the assembly register or DRAM buffer, or other 

2 miscellaneous information regarding the packet. The packet control sequencer 176 also stores 

3 entries into a receive buffer queue and a receive statistics queue via the queue manager 1 84, 

4 An advantage of processing a packet by hardware logic 171 is that the packet does not, in 

5 contrast with conventional sequential software protocol processing, have to be stored, moved, 

6 copied or pulled from storage for processing each protocol layer header, offering dramatic 

7 increases in processing efficiency and savings in processing time for each packet The packets 

8 can be processed at the rate bits are received from the network, for example 1 00 

9 megabits/second for a 1 00 baseT connection. The time for categorizing a packet received at 
10 this rate and having a length of sixty bytes is thus about 5 microseconds. The total time for 

S 1 1 processing this packet with the hardware logic 171 and sending packet data to its host 

12 destination via the fast-path may be about 16 microseconds or less, assuming a 66 MHz PCI 

m 13 bus, whereas conventional software protocol processing by a 300 MHz Pentium H® processor 

5 14 may take as much as 200 microseconds in a busy device. More than an order of magnitude 

f 15 decrease in processing time can thus be achieved with fast-path 1 59 in comparison with a 

fy 16 high-speed CPU employing conventional sequential software protocol processing, 

G 17 demonstrating the dramatic acceleration provided by processing the protocol headers by the 

0 18 hardware logic 171 and processor 170, without even considering the additional time savings 

19 afforded by the reduction in CPU interrupts and host bus bandwidth savings. 

20 The processor 1 70 chooses, for each received message packet held in storage 1 85, whether 

21 that packet is a candidate for the fast-path 159 and, if so, checks to see whether a fast-patih has 

22 already been set up for the connection that the packet belongs to. To do this, the processor 170 

23 first checks the header status summary to determine whether the packet headers are of a 

24 protocol defined for fast-path candidates. If not, the processor 170 cormnands DMA 

25 controllers in the INIC 150 to send the packet to the host for slow-path 158 processing. Even 

26 for a slow-pafli 158 processing of a message, the INIC 150 thus performs initial procedures 

27 such as validation and determination of message type, and passes the validated message at 

28 least to the data link layer 1 60 of the host. 

29 For fast-path 159 candidates, the processor 170 checks to see whether the header status 

30 summary matches a CCB held by the INIC. If so, the data from the packet is sent along fast- 

31 path 159 to the destination 168 in the host. If the fast-path 159 candidate's packet summary 

32 does not match a CCB held by the INIC, the packet may be sent to the host 152 for slow-path 
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1 processing to create a CCB for the message. Employment of the fast-path 159 may also not be 

2 needed or desirable for the case of fragmented messages or other complexities. For tiie vast 

3 majority of messages, however, the INIC fast-path 159 can greatly accelerate message 

4 processing. The INIC 150 thus provides a single state machine processor 170 that decides 

5 whether to send data directly to its destination, based upon information gleaned on the fly, as 

6 opposed to the conventional employment of a state machine in each of several protocol layers 

7 for determining the destiny of a given packet, 

8 In processing an indication or packet received at the host 1 52, a protocol driver of the host 

9 selects the processing route based upon whether the indication is fast-path or slow-path. A 

10 TCP/IP or SPX/IPX message has a connection that is set up from which a CCB is fomied by 

1 1 the driver and passed to the INIC for matching with and guiding the fast-path packet to the 

12 connection destination 168. For a TTCP/IP message, the driver can create a connection 

13 context for the transaction from processing an initial request packet, including locating the 

14 message destination 1 68, and then passing that context to the INIC in the form of a CCB for 

15 providing a fast-path for a reply from that destination. A CCB includes connection and state 

16 information regarding the protocol layers and packets of the message. Thus a CCB can 

17 include source and destination media access control (MAC) addresses, source and destination 

18 IP or IPX addresses, source and destination TCP or SPX ports, TCP variables such as timers, 

19 receive and transmit windows for sliding window^ protocols, and information indicating the 

20 session layer protocol. 

21 Caching the CCBs in a hash table in the INIC provides quick comparisons with words 

22 summarizing incoming packets to determine whether the packets can be processed via the fast- 

23 path 159, while the full CCBs are also held in the INIC for processing. Other ways to 

24 accelerate this comparison include software processes such as a B-tree or hardware assists 

25 such as a content addressable memory (CAM). When INIC microcode or comparator circuits 

26 detect a match with the CCB, a DMA controller places the data from the packet in the 

27 destination 168, without any interrupt by the CPU, protocol processing or copying. Depending 

28 upon the type of message received, the destination of the data may be the session, presentation 

29 or application layers, or a file buffer cache in the host 1 52. 

30 FIG. 9 shows an INIC 200 connected to a host 202 that is employed as a file server. This 

31 INIC provides a network interface for several network connections employing the 802. 3u 

32 standard, commonly known as Fast Ethernet. The INIC 200 is connected by a PCI bus 205 to 
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1 the server 202, which maintains a TCP/IP or SPX/IPX protocol stack including MAC layer 

2 212, network layer 215, transport layer 217 and application layer 220, with a 

3 source/destination 222 shown above the application layer, although as mentioned earlier the 

4 application layer can be the source or destination. The INIC is also connected to network lines 

5 210, 240, 242 and 244, which are preferably Fast Ethernet, twisted pair, fiber optic, coaxial 

6 cable or other lines each allowing data transmission of 100 Mb/s, while faster and slower data 

7 rates are also possible. Network Unes 210, 240, 242 and 244 are each connected to a dedicated 

8 row of hardware circuits which can each validate and summarize message packets received 

9 from their respective network line. Thus line 21 0 is connected with a first horizontal row of 

10 sequencers 250, line 240 is connected with a second horizontal row of sequencers 260, line 

1 1 242 is connected with a third horizontal row of sequencers 262 and line 244 is connected with 

12 a fourth horizontal row of sequencers 264. After a packet has been validated and summarized 

13 by one of the horizontal hardware rows it is stored along with its status summary in storage 

14 270. 

15 A network processor 230 determines, based on that summary and a comparison with any 

16 CCBs stored in the INIC 200, whether to send a packet along a slow-path 23 1 for processing 

17 by the host. A large majority of packets can avoid such sequential processing and have their 
. 18 data portions sent by DMA along a fast-path 237 directly to the data destination 222 in the 

19 server according to a matching CCB. Similarly, the fast-path 237 provides an avenue to send 

20 data directly fi:om the source 222 to any of the network lines by processor 230 division of the 

21 data into packets md addition of Ml headers for network transmission, agaiti minimizing CPU 

22 processing and interrupts. For clarity only horizontal sequencer 250 is shown active; in 

23 actuaUty each of the sequencer rows 250, 260, 262 and 264 offers full duplex communication, 

24 concurrently with all other sequencer rows. The specialized INIC 200 is much faster at 

25 working with message packets than even advanced general-purpose host CPUs that processes 

26 those headers sequentially according to the software protocol stack. 

27 One of the most commonly used network protocols for large messages such as file transfers 

28 is server message block (SMB) over TCP/IP. SMB can operate in conjunction with redirector 

29 software that determines whether a required resource for a particular operation, such as a 

30 printer or a disk upon which a file is to be written, resides in or is associated with the host from 

31 which the operation was generated or is located at another host connected to the network, such 

32 . as a file server. SMB and server/redirector are conventionally serviced by the transport layer; 
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1 in the present invention SMB and redirector can instead be serviced by the INIC. In this case, 

2 sending data by the DMA controllers fcom the INIC buffers when receiving a large SMB 

3 transaction may greatly reduce interrupts that the host must handle. Moreover, this DMA 

4 generally moves the data to its final destination in the file device cache. An SMB transmission 

5 of the present invention follows essentially the revise of the above described SMB receive, 

6 with data transferred IBrom the host to the INIC and stored in buffers, while the associated 

7 protocol headers are prepended to the data in the INIC, for transmission via a network line to a 

8 rem.ote host. Processing by the INIC of the multiple packets and multiple TCP, IP, NetBios 

9 and SMB protocol layers via custom hardware and without repeated interrupts of the host can 

10 greatly increase the speed of transmitting an SMB message to a network line. 

1 1 As shown in FIG. 1 0, for controlling whether a given message is processed by the host 202 

12 or by the INIC 200, a message command driv^ 300 may be installed in host 202 to work in 

13 concert with a host protocol stack 310. The command driver 300 can intervene in message 

14 reception or transmittal, create CCBs and send or receive CCBs fi-om the INIC 200, so that 

15 functioning of the INIC, aside jfrom improved performance, is transparent to a user. Also 

16 shown is an INIC memory 304 and an INIC miniport driver 306, which can direct message 

17 packets received from network 210 to either the conventional protocol stack 310 or the 

18 command protocol stack 300, depending upon whether a packet has been labeled as a fast-path 

19 candidate. The conventional protocol stack 3 1 0 has a data link layer 3 12, a network layer 314 

20 and a transport layer 3 1 6 for conventional, lower layer processing of messages that are not 

21 labeled as fast-path candidates and therefore not processed by the command stack 300. 

22 Residing above the lower layer stack 3 1 0 is an upper layer 318, which represents a session, 

23 presentation and/or appUcation layer, depending upon the message communicated. The 

24 command driver 300 similarly has a data hnk layer 320, a network layer 322 and a transport 

25 layer 325. 

26 The driver 300 includes an upper layer interface 330 that determines, for transmission of 

27 messages to the network 210, whether a message transmitted fi-om the upper layer 3 1 8 is to be 

28 processed by the command stack 300 and subsequently the INIC fast-path, or by the 

29 conventional stack 310. When the upper layer interface 330 receives an appropriate message 

30 fi-om the upper layer 3 1 8 that would conventionally be intended for transmission to the 

3 1 network after protocol processing by the protocol stack of the host, the message is passed to 

32 driver 300. The INIC then acquires network-sized portions of the message data for that 

21 



ALA-006A 

1 transmission via INIC DMA units, prepends headers to the data portions and sends the 

2 resulting message packets down the wire. Conversely, in receiving a TCP, TTCP, SPX or 

3 similar message packet from the network 2 1 0 to be used in setting up a fast-path connection, 

4 miniport driver 306 diverts that message packet to command driver 300 for processing. The 

5 driver 300 processes the message packet to create a context for that message, with the driver 

6 302 passing the context and command instructions back to the INIC 200 as a CCB for sending 

7 data of subsequent messages for the same comiection along a fast-path. Hundreds of TCP, 

8 TTCP, SPX or similar CCB coimections may be held indefinitely by the INIC, although a least 

9 recently used (LRU) algorithm is employed for the case when the INIC cache is foil. The 

10 driver 300 can also create a connection context for a TTCP request which is passed to the INIC 

1 1 200 as a CCB, allowing fast-path transmission of a TTCP reply to the request. A message 

12 having a protocol that is not accelerated can be processed conventionally by protocol stack 

13 310. 

14 FIG. 1 1 shows a TCP/IP implementation of command driver software for Microsoft® 

15 protocol messages. A conventional host protocol stack 350 includes MAC layer 353, IP layer 

16 355 and TCP layer 358. A command driver 360 works in concert with the host stack 350 to 

17 process network messages. The command driver 360 includes a MAC layer 363, an IP layer 

18 366 and an Alacritech TCP (ATCP) layer 373. The conventional stack 350 and command 

19 driver 360 share a network driver interface specification (NDIS) layer 375, which interacts 

20 with the INIC miniport driver 306. The INIC miniport driver 306 sorts receive indications 

21 for processing by either the conventional host stack 350 or the ATCP driver 360. A TDI filter 

22 driver and upper layer interface 380 similarly determines whether messages sent firom a TDI 

23 user 382 to the network are diverted to the command driver and perhaps to the fast-path of the 

24 INIC, or processed by the host stack. 

25 FIG. 12 depicts a typical SMB exchange between a client 190 and server 290, both of 

26 which have communication devices of the present invention, the conmiunication devices each 

27 holding a CCB defining their coimection for fast-path movement of data. The client 190 

28 includes INIC 150, 802,3 compliant data link layer 160, IP layer 162, TCP layer 164, NetBios 

29 layer 166, and SMB layer 168. The cUent has a slow-path 157 and fast-path 159 for 

30 communication processing. Similarly, the server 290 includes INIC 200, 802.3 compliant data 

31 link layer 212, IP layer 215, TCP layer 217, NetBios layer 220, and SMB 222. The server is 
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1 connected to network lines 240, 242 and 244, as well as line 210 which is connected to client 

2 190. The server also has a slow-path 23 1 and fast-path 237 for communication processing. 

3 Assuming that the cHent 1 90 wishes to read a 1 00KB file on the server 290, the client may 

4 begin by sending a Read Block Raw (RBR) SMB command across network 2 1 0 requesting the 

5 first 64 KB of that file on the server 290. The RBR command may be only 76 bytes, for 

6 example, so the INIC 200 on the server will recognize the message type (SMB) and relatively 

7 small message size, and send the 76 bytes directly via the fast-path to NetBios of the server. 

8 NetBios will give the data to SMB, which processes tiie Read request and fetches the 64KB of 

9 data into server data buffers. SMB then calls NetBios to send tiie data, and NetBios outputs 

10 the data for the client. In a conventional host, NetBios would call TCP output and pass 64 KB 

1 1 to TCP, which would divide the data into 1460 byte segments and output each segment via IP 

12 and eventually MAC (slow-path 23 1). In the present case, the 64KB data goes to the ATCP 

13 driver along with an indication regarding the client-server SMB connection, which indicates a 

14 CCB held by the INIC. The INIC 200 then proceeds to DMA 1460 byte segments firom the 

15 host buffers, add the appropriate headers for TCP, IP and MAC at one time, and send tiie 

16 completed packets on the network 210 (fast-path 237). The INIC 200 will repeat this until tiie 

17 whole 64KB transfer has been sent. Usually after receiving acknowledgement from the client 

18 that the 64KB has been received, the INIC will then send tiie remaining 36KB also by tiie fast- 

19 path 237. 

20 With INIC 150 operating on the client 190 when this reply arrives, the INIC 150 recognizes 

2 1 fi-om the first firame received that this connection is receiving fast-path 159 processing 

22 (TCP/IP, NetBios, matching a CCB), and tiie ATCP may use tiiis first fimne to acquire buffer 

23 Space for the message. This latter case is done by passing the first 128 bytes of the NetBios 

24 portion of the firame via the ATCP fast-path directiy to the host NetBios; that will give 

25 NetBios/SMB all of the firame's headers. NetBios/SMB will analyze these headers, realize by 

26 matching with a request ID that this is a reply to tiie original RawRead connection, and give 

27 tiie ATCP a 64K Ust of buffers into which to place tiie data. At tiiis stage only one firame has 

28 arrived, although more may arrive while this processing is occurring. As soon as the client 

29 buffer list is given to tiie ATCP, it passes tiiat transfer information to the INIC 1 50, and the 

30 DSflC 1 50 starts DMAing any frame data that has accumulated into those buffers. 

3 1 FIG. 1 3 provides a simplified diagram of the INIC 200, which combines the functions of a 

32 network interface contioller and a protocol processor in a single ASIC chip 400. The INIC 
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1 200 in this embodiment offers a full-duplex, four channel, 10/100-Megabit per second (Mbps) 

2 intelligent network interface controller that is designed for high speed protocol processing for 

3 server applications. Although designed specifically for server applications, the INIC 200 can 

4 be connected to personal computers, workstations, routers or other hosts anywhere that 

5 TCP/IP, TTCP/IP or SPX/IPX protocols are being utilized. 

6 The INIC 200 is connected with four network lines 210, 240, 242 and 244, which may 

7 transport data along a number of different conduits, such as twisted pair, coaxial cable or 

8 optical fiber, each of the connections providing a media independent interface (Mil) via 

9 commercially available physical layer chips, such as model 80220/80221 Ethernet Media 

10 Interface Adapter from SEEQ Technology Incorporated, 47200 Bayside Parkway, Fremont, 

1 1 CA 94538. The lines preferably are 8023 comphant and in connection with the INIC 

12 constitute four complete Ethernet nodes, the INIC supporting lOBase-T, 10Base-T2, lOOBase- 

13 TX, 100Base-FX and 100Base-T4 as well as future interface standards. Physical layer 

14 identification and initialization is accomplished through host driver initialization routines. The 

15 connection between the network lines 210, 240, 242 and 244 and the INIC 200 is controlled by 

16 MAC units MAC-A 402, MAC-B 404, MAC-C 406 and MAC-D 408 which contain logic 

17 circuits for performing the basic functions of the MAC sublayer, essentially controlling when 

18 the INIC accesses the network lines 210, 240, 242 and 244. The MAC units 402-408 may act 

19 in promiscuous, multicast or unicast modes, allowing the INIC to function as a network 

20 monitor, receive broadcast and multicast packets and implement multiple MAC addresses for 

21 each node. The MAC iraits 402-408 also provide statistical information that can be used for 

22 simple network management protocol (SNMP). 

23 The MAC units 402, 404, 406 and 408 are each connected to a transmit and receive 

24 sequencer, XMT & RCV-A 418, XMT & RCV-B 420, XMT & RCV-C 422 and XMT & 

25 RCV-D 424, by wires 410, 412, 414 and 416, respectively. Each of the transmit and receive 

26 sequencers can perform several protocol processing steps on the fly as message frames pass 

27 through that sequencer. In combination with the MAC units, the transmit and receive 

28 sequencers 418-422 can compile the packet status for the data link, network, transport, session 

29 and, if appropriate, presentation and application layer protocols in hardware, greatly reducing 

30 the time for such protocol processing compared to conventional sequential software engines. 

31 The transmit and receive sequencers 41 0-414 are connected, by lines 426, 428, 430 and 432 to 

32 an SRAM and DMA controller 444, which includes DMA controllers 438 and SRAM 
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1 controller 442. Static random access memory (SRAM) buffers 440 are coupled with SRAM 

2 controller 442 by line 441 . The SRAM and DMA controllers 444 interact across line 446 with 

3 external memory control 450 to send and receive frames via external memory bus 455 to and 

4 from dynamic random access memory (DRAM) buffers 460, which is located adjacent to the 

5 IC chip 400. The DRAM buffers 460 may be configured as 4 MB, 8 MB, 1 6 MB or 32 MB, 

6 and may optionally be disposed on the chip. The SRAM and DMA controllers 444 are 

7 connected via line 464 to a PCI Bus Interface Unit (BIU) 468, which manages the interface 

8 between the IMC 200 and the PCI interface bus 257. The 64-bit, multiplexed BIU 468 

9 provides a direct interface to the PCI bus 257 for both slave and master fimctions. The INIC 

10 200 is capable of operating in either a 64-bit or 32-bit PCI environment, while supporting 64- 

1 1 bit addressing in either configuration. 

12 A microprocessor 470 is connected by line 472 to the SRAM and DMA controllers 444, 

13 and connected via line 475 to the PCI BIU 468. Microprocessor 470 instructions and register 

14 files reside in an on chip control store 480, which includes a writable on-chip control store 

15 (WCS) of SRAM and a read only memory (ROM), and is connected to the microprocessor by 

16 line 477. The microprocessor 470 offers a programmable state machine which is capable of 

17 processing incoming frames, processing host commands, directing network traffic and 

18 ^ directing PCI bus traffic. Three processors are implemented using shared hardware in a three 

19 level pipelined architecture that launches and completes a single instruction for every clock 

20 cycle. A receive processor 482 is primarily used for receiving communications while a 

21 transmit processor 484 is primarily used for transmitting communications in order to facilitate 

22 full duplex communication, while a utility processor 486 offers various fimctions including 

23 overseeing and controlling PCI register access, 

24 The instructions for the three processors 482, 484 and 486 reside in the on-chip control- 

25 store 480. Thus the functions of the three processors can be easily redefined, so that the 

26 microprocessor 470 can adapted for a given environment. For instmce, the amount of 

27 processing required for receive fimctions may outweigh that required for either transmit or 

28 utility fimctions. In this situation, some receive fimctions may be performed by the transmit 

29 processor 484 and/or the utility processor 486. Altematively, an additional level of pipelining 

30 can be created to yield four or more virtual processors instead of three, with the additional 

3 1 level devoted to receive fimctions. 
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1 The INIC 200 in this embodiment can support up to 256 CCBs which are maintained in a 

2 table in the DRAM 460. There is also, however, a CCB index in hash order in the SRAM 440 

3 to save sequential searching. Once a hash has been generated, the CCB is cached in SRAM, 

4 with up to sixteen cached CCBs in SRAM in this example. Allocation of the sixteen CCBs 

5 cached in SRAM is handled by a least recently used register, described below. These cache 

6 locations are shared between the transmit 484 and receive 486 processors so that the processor 

7 with the heavier load is able to use more cache buffers. There are also eight header buffers 

8 and eight command buffers to be shared betv/een the sequencers. A given header or command 

9 buffer is not statically linked to a specific CCB buffer, as the link is dynamic on a per-frame 

10 basis. 

1 1 FIG. 14 shows an overview of Ihe pipelined microprocessor 470, in v/hich instructions for 

12 the receive, transmit and utility processors are executed in three alternating phases according 

13 to Clock increments I, II and III, the phases corresponding to each of the pipeline stages. Each 

14 phase is responsible for different functions, and each of the three processors occupies a 

15 different phase during each Clock increment. Each processor usually operates upon a different 

16 instruction stream fi^om the control store 480, and each carries its own program counter and 

17 status through each of the phases. 

18 In general, a first instruction phase 500 of the pipelined microprocessors completes an 

19 instruction and stores the result in a destination operand, fetches the next instruction, and 

20 stores that next instruction in an instruction register. A first register set 490 provides a number 

21 of registers including the instruction register, and a set of controls 492 for first register set 

22 provides the controls for storage to the first register set 490. Some items pass through the first 

23 phase without modification by the controls 492, and instead are simply copied into the first 

24 register set 490 or a RAM file register 533. A second instruction phase 560 has an instruction 

25 decoder and operand multiplexer 498 that generally decodes the instruction that was stored in 

26 the instruction register of the first register set 490 and gathers any operands which have been 

27 generated, which are then stored in a decode register of a second register set 496. The first 

28 register set 490, second register set 496 and a third register set 501 , which is employed in a 

29 third instruction phase 600, include many of the same registers, as will be seen in the more 

30 detailed views of FIGs. 15A-C. The instruction decoder and operand multiplexer 498 can read 

31 from two address and data ports of the RAM file register 533, which operates in both the first 

32 phase 500 and second phase 560. A third phase 600 of the processor 470 has an arithmetic 
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1 logic unit (ALU) 602 which generally performs any ALU operations on the opermids from the 

2 second register set, storing the results in a results register included in the third register set 501 . 

3 A stack exchange 608 can reorder register stacks, and a queue manager 503 can arrange 

4 queues for the processor 470, the results of which are stored in the third register set. 

5 The instructions continue with the first phase then following the third phase, as depicted by a 

6 circular pipeline 505. Note that various functions have been distributed across the three phases 

7 of the instruction execution in order to minimize the combinatorial delays within any given 

8 phase. With a frequency in this embodiment of 66 MHz, each Clock increment takes 1 5 

9 nanoseconds to complete, for a total of 45 nanoseconds to complete one instruction for each of 

10 the three processors. The rotating instruction phases are depicted in more detail in FIGs. 1 5A- 

11 C, in which each phase is shown in a different figure. 

12 More particularly, FIG. 1 5 A shows some specific hardware fimctions of the first phase 500, 

13 which generally includes the first register set 490 and related controls 492. The controls for the 

14 first register set 492 includes an SRAM control 502, which is a logical control for loading 

15 address and write data into SRAM address and data registers 520. Thus the output of the ALU 

16 602 from the third phase 600 may be placed by SRAM control 502 into an address register or 

17 data register of SRAM address and data registers 520. A load control 504 similarly provides 

18 controls for writing a context for a file to file context register 522, and another load control 

19 506 provides controls for storing a variety of miscellaneous data to flip-flop registers 525. 

20 ALU condition codes, such as whether a carried bit is set, get clocked into ALU condition 

21 codes register 528 without an operation performed in the first phase 500. Flag decodes 508 

22 can perform various fimctions, such as setting locks, that get stored in flag registers 530. 

23 The RAM file register 533 has a single write port for addresses and data and two read ports 

24 for addresses and data, so that more than one register can be read from at one time. As noted 

25 above, the RAM file register 533 essentially straddles the first and second phases, as it is 

26 written in the first phase 500 and read from in the second phase 560. A control store 

27 instruction 510 allows the reprogramming of the processors due to new data in from the 

28 control store 480, not shown in this figure, the instructions stored in an instruction register 

29 535. The address for this is generated in a fetch control register 511, which determines which 

30 address to fetch, the address stored in fetch address register 538. Load control 5 1 5 provides 

31 instructions for a program counter 540, which operates much like the fetch address for the 

32 control store. A last-in first-out stack 544 of three registers is copied to the first register set 



27 



ALA-006A 



1 without undergoing other operations in this phase. Finally, a load control 517 for a debug 

2 address 548 is optionally included, which allows correction of errors that may occur. 

3 FIG. 15B depicts the second microprocessor phase 560, which includes reading addresses 

4 and data out of the RAM file register 533. A scratch SRAM 565 is written from SRAM 

5 address and data register 520 of the first register set, which includes a register that passes 

6 . through the first two phases to be incremented in the third. The scratch SRAM 565 is read by 

7 the instruction decoder and operand multiplexer 498, as are most of the registers from the first 

8 register set, with the exception of the stack 544, debug address 548 and SRAM address and 

9 data register mentioned above. The instruction decoder and operand multiplexer 498 looks at 
y. 10 the various registers of set 490 and SRAM 565, decodes the instructions and gathers the 

2 1 1 operands for operation in the next phase, in particular determining the operands to provide to 

flJ 12 the ALU 602 below. The outcome of the instruction decoder and operand multiplexer 498 is 

^ 13 Stored to a number of registers in the second register set 496, inchiding ALU operands 579 and 

£ 14 582, ALU condition code register 580, and a queue channel and command 587 register, which 

2 15 in this embodiment can control thirty-two queues. Several of the registers in set 496 are 

hi 16 loaded fairly directly from the instruction register 535 above without substantial decoding by 

17 the decoder 498, including a program control 590, a literal field 589, a test select 584 and a 

O 18 flag select 585. Other registers such as the file context 522 of the first phase 500 are always 

^ 19 stored in a file context 577 of the second phase 560, but may also be treated as an operand that 

20 is gathered by the multiplexer 572. The stack registers 544 are simply copied in stack register 

21 594. The program counter 540 is incremented 568 in this phase and stored in register 592. 

22 Also incremented 570 is the optional debug address 548, and a load control 575 may be fed 

23 from the pipeline 505 at this point in order to allow error control in each phase, the result 

24 stored in debug address 598. 

25 FIG. 15C depicts the third microprocessor phase 600, which includes ALU and queue 

26 operations. The ALU 602 includes an adder, priority encoders and other standard logic 

27 ftmctions. Results of the ALU are stored in registers ALU output 618, ALU condition codes 

28 620 and destination operand results 622. A file context register 616, flag select register 626 

29 and literal field register 630 are simply copied from the previous phase 560. A test multiplexer 

30 604 is provided to determine whether a conditional jump results in a jump, with the results 

31 stored in a test results register 624. The test multiplexer 604 may instead be performed in the 

32 first phase 500 along with similar decisions such as fetch control 5 1 1 . A stack exchange 608 
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1 shifts a stack up or down by fetching a program counter from stack 594 or putting a program 

2 counter onto that stack, results of which are stored in program control 634, program counter 

3 638 and stack 640 registers. The SRAM address may optionally be incremented in this phase 

4 600. Another load control 610 for another debug address 642 may be forced from the pipehne 

5 505 at this point in order to allow error control in this phase also. A QRAM & QALU 606, 

6 shown together in this figure, read from the queue channel and command register 587, store in 

7 SRAM and rearrange queues, adding or removing data and pointers as needed to manage the 

8 queues of data, sending results to the test multiplexer 604 and a queue flags and queue address 

9 register 628. Thus the QRAM & QALU 606 assume the duties of managing queues for the 
10 . three processors, a task conventionally performed sequentially by software on a CPU, the 

i 1 queue manager 606 instead providing accelerated and substantially parallel hardware queuing. 

12 FIG. 1 6 depicts two of the thirty-two hardware queues that are managed by the queue 

13 manager 606, with each of the queues having an SRAM head, an SRAM tail and the ability to 

14 queue information in a DRAM body as well, allowing expansion and individual configuration 

15 of each queue. Thus FIFO 700 has SRAM storage units, 705, 707, 709 and 71 1, each 

16 containing eight bytes for a total of thirty-two bytes, although the number and capacity of 

17 these units may vary in other embodiments. Similarly, FIFO 702 has SRAM storage imits 

18 713, 715, 717 and 719. SRAM units 705 and 707 are the head of FIFO 700 and units 709 and 

19 71 1 are the tail of that FIFO, while units 713 and 715 are the head of FIFO 702 and units 717 

20 and 719 are the tail of that FIFO. Information for FIFO 700 may be written into head units 

21 705 or 707, as shown by arrow 722, and read from tail units 71 1 or 709, as shown by arrow 

22 725. A particular entry, however, may be both written to and read from head units 705 or 707, 

23 or may be both written to and read from tail imits 709 or 71 1 , minimizing data movement and 

24 latency. Similarly, information for FIFO 702 is typically written into head units 713 or 715, as 

25 shown by arrow 733, and read from tail units 717 or 719, as shown by arrow 739, but may 

26 instead be read from the same head or tail unit to which it was written. 

27 The SRAM FIFOS 700 and 702 are both connected to DRAM 460, which allows virtually 

28 imlimited expansion of those FIFOS to handle situations in which the SRAM head and tail are 

29 Ml For example a first of the thirty-two queues, labeled Q-zero, may queue an entry in 

30 DRAM 460, as shown by arrow 727, by DMA units acting under direction of the queue 

31 maaager, instead of being queued in the head or tail of FIFO 700. Entries stored in DRAM 

32 460 retum to SRAM unit 709, as shown by arrow 730, extending the length and fall-through 
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1 time of that FIFO. Diversion from SRAM to DRAM is typically reserved for when the SRAM 

2 is full, since DRAM is slower and DMA movement causes additional latency. Thus Q-zero 

3 may comprise the entries stored by queue manager 606 in both the FIFO 700 and the DRAM 

4 460. Likewise, information bound for FIFO 702, which may correspond to Q-twenty-seven, 

5 for example, can be moved by DMA into DRAM 460, as shown by arrow 735. The capacity 

6 for queuing in cost-effective albeit slower DRAM 460 is user-definable during initialization, 

7 allowing the queues to change in size as desired. Information queued in DRAM 460 is 

8 returned to SRAM unit 717, as shown by arrow 737. 

9 Status for each of the thirty-two hardware queues is conveniently maintained in and 
10 accessed from a set 740 of four, thirty-two bit registers, as shown in FIG. 17, in which a 

h 1 1 specific bit in each register corresponds to a specific queue. The registers are labeled Q- 

5 12 Out_Ready 745, Q-In__Ready 750, Q-Empty 755 and Q-FuU 760. If a particular bit is set in 

kl 13 the Q-Out_Ready register 750, the queue corresponding to that bit contains information that is 

14 ready to be read, while the setting of the same bit in the Q-In__Ready 752 register means that 

O 15 the queue is ready to be written. Similarly, a positive setting of a specific bit in the Q-Empty 

H= 16 register 755 means that the queue corresponding to that bit is empty, while a positive setting of 

[l 17 a particular bit in the Q-FuU register 760 means that the queue corresponding to that bit is fiall. 

2^ 18 Thus Q-Out_Ready 745 contains bits zero 746 through thirty-one 748, including bits twenty- 

P 19 seven 752, twenty-eight 754, twenty-nine 756 and thirty 758. Q-In_Ready 750 contains bits 

20 zero 762 through tiiirty-one 764, including bits twenty-seven 766, twenty-eight 768, twenty- 

21 nine 770 and thirty 772. Q-Empty 755 contains bits zero 774 through thirty-one 776, 

22 including bits twenty-seven 778, twenty-eight 780, twenty-nine 782 and thirty 784, and Q-fiiU 

23 760 contains bits zero 786 through thirty-one 788, including bits twenty-seven 790, twenty- 

24 eight 792, twenty-nine 794 and thirty 796. 

25 Q-zero, corresponding to FIFO 700, is a free buffer queue, which holds a list of addresses 

26 for all available buffers. This queue is addressed when the microprocessor or other devices 

27 need a free buffer address, and so commonly includes appreciable DRAM 460. Thus a device 

28 needing a free buffer address would check with Q-zero to obtain that address, Q-twenty- 

29 seven, corresponding to FIFO 702, is a receive buffer descriptor queue. Afler processing a 

30 received frame by the receive sequencer the sequencer looks to store a descriptor for the frame 

3 1 in Q-twenty-seven. If a location for such a descriptor is immediately available in SRAM, bit 

32 twenty-seven 766 of Q-In_Ready 750 will be set. If not, the sequencer must wait for the queue 
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1 manager to iiritiate a DMA move from SRAM to DRAM, thereby freeing space to store the 

2 receive descriptor. 

3 Operation of the queue manager, which manages movement of queue entries between 

4 SRAM and the processor, the transmit and receive sequencers, and also between SRAM and 

5 DRAM, is shown in more detail in FIG. 1 8. Requests which utilize the queues include 

6 Processor Request 802, Transmit Sequencer Request 804, and Receive Sequencer Request 

7 806. Other requests for the queues are DRAM to SRAM Request 808 and SRAM to DRAM 

8 Request 810, which operate on behalf of the queue manager in moving data back and forth 

9 between the DRAM and the SRAM head or tail of the queues. Determining which of these 

10 various requests will get to use the queue manager in the next cycle is handled by priority logic 

1 1 Arbiter 815. To enable high frequency operation the queue manager is pipelined, with 

12 Register A 8 1 8 and Register B 820 providing temporary storage, while Status Register 822 

13 maintains status xintil the next update. The queue manager reserves even cycles for DMA, 

14 receive and transmit sequencer requests and odd cycles for processor requests. Dual ported 

15 QRAM 825 stores variables regarding each of the queues, the variables for each queue 

16 including a Head Write Pointer, Head Read Pointer, Tail Write Pointer and Tail Read Pointer 

17 , corresponding to the queue's SRAM condition, and a Body Write Pointer and Body Read 

18 Pointer corresponding to the queue's DRAM condition and the queue's size. 

19 After Arbiter 81 5 has selected the next operation to be performed, the variables of QRAM 

20 825 are fetched and modified according to the selected operation by a QALU 828, and an 

21 SRAM Read Request 830 or an SRAM Write Request 840 may be generated. The variables 

22 are updated and the updated status is stored in Status Register 822 as well as QRAM 825. The 

23 status is also fed to Arbiter 8 1 5 to signal that the operation previously requested has been 

24 fulfilled, inhibiting duplication of requests. The Status Register 822 updates the four queue 

25 registers Q-Out_Ready 745, Q-In_Ready 750, Q-Empty 755 and Q-FuU 760 to reflect the new 

26 status of the queue that was accessed. Similarly updated are SRAM Addresses 833, Body 

27 Write Request 835 and Body Read Requests 838, which are accessed via DMA to and from 

28 SRAM head and tails for that queue. Altematively, various processes may wish to write to a 

29 queue, as shown by Q Write Data 844, which are selected by multiplexor 846, and pipelined to 

30 SRAM Write Request 840. The SRAM controller services the read and write requests by 

31 writing the tail or reading the head of the accessed queue and returning an acknowledge. In 

32 this manner the various queues are utilized and their status updated. 
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1 FIGs. 1 9A-C show a least-recently-used register 900 that is employed for choosing which 

2 contexts or CCBs to maintain in INIC cache memory. The INIC in this embodiment can cache 

3 up to sixteen CCBs in SRAM at a given time, and so when a new CCB is cached an old one 

4 must often be discarded, the discarded CCB usually chosen according to this register 900 to be 

5 the CCB that has been used least recently. Li this embodiment, a hash table for up to two 

6 hundred fifty-six CCBs is also maintained in SRAM, while up to two hundred fifty-six fiiU 

7 CCBs are held in DRAM. The least-recently-used register 900 contains sixteen four-bit blocks 

8 labeled RO-Rl 5, each of which corresponds to an SRAM cache unit. Upon initialization, the 

9 blocks are numbered 0-1 5, with number 0 arbitrarily stored in the block representing the least 
10 recently used (LRU) cache unit and number 1 5 stored in Ihe block representing the most 

2 1 1 recently used (MRU) cache unit. FIG. 1 9A shows the register 900 at an arbitrary' time when 

W 12 the LRU block RO holds the number 9 and the MRU block Rl 5 holds the number 6. 
m 13 When a differait CCB Ihan is currently being held m SRAM is to be cached, the LRU 

f 14 block RO is read, which in FIG. 19A holds the number 9, and the new CCB is stored in the 

r 15 SRAM cache unit corresponding to number 9. Since tibe new CCB corresponding to number 

m 16 9 is now the most recently used CCB, the number 9 is stored in the MRU block, as shown in 

H 17 FIG. 19B. The other numbers are all shifted one register block to the left, leaving the number 

0 18 1 in the LRU block. The CCB that had previously been cached in the SRAM unit 

^ 19 corresponding to number 9 has been moved to slower but more cost-effective DRAM. 

20 FIG. 19C shows the result when the next CCB used had akeady been cached in SRAM. In 

21 this example, tibie CCB was cached in an SRAM unit corresponding to number 10, and so after 

22 employment of that CCB, number 1 0 is stored in the MRU block. Only those numbers which 

23 had previously been more recently used than number 1 0 (register blocks R9-R1 5) are shifted 

24 to the left, leaving the number 1 in the LRU block. In this manner the INIC maintains the 

25 most active CCBs in SRAM cache. 

26 In some cases a CCB being used is one that is not desnable to hold in the limited cache 

27 memory. For example, it is preferable not to cache a CCB for a context that is known to be 

28 closing, so that other cached CCBs can remain in SRAM longer. In this case, the number 

29 representing the cache imit holding the decacheable CCB is stored in the LRU block RO rather 

30 than the MRU block Rl 5, so that the decacheable CCB will be replaced immediately upon 

3 1 employment of a new CCB that is cached in the SRAM unit corresponding to the number held 

32 in the LRU block RO. FIG. 1 9D shows the case for which number 8 (which had been in block 
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1 R9 in FIG. 1 9C) corresponds to a CCB tiiat will be used and then closed. In this case number 

2 8 has been removed from block R9 and stored in the LRU block RO. All the numbers that had 

3 previously been stored to the left of block R9 (Rl -R8) are then shifted one block to the right. 

4 FIG. 20 shows some of the logical units employed to operate the least-recently-used 

5 register 900. An array of sixteen, three or four input miiltiplexors 9 1 0, of which only 

6 multiplexors MUXO, MUX7, MUX8, MUX9 and MUXl 5 are shown for clarity, have outputs 

7 fed into the corresponding sixteen blocks of least-recently-used register 900. For example, the 

8 output of MUXO is stored m block RO, the output of MUX7 is stored in block R7, etc. The 

9 value of each of the register blocks is connected to an input for its corresponding multiplexor 

10 and also into inputs for both adjacent multiplexors, for use in shifting the block numbers. For 

1 1 instance, the number stored in R8 is fed into inputs for MUX7, MUX8 and MUX9. MUXO 

12 and MUXl 5 each have only one adjacent block, and the extra input for those multiplexors is 

13 used for the selection of LRU and MRU blocks, respectively. MUX15 is shown as a four- 

14 input multiplexor, with input 915 providing the number stored on RO. 

15 An array of sixteen comparators 920 each receives the value stored in the corresponding 

16 block of the least-recently-used register 900. Each comparator also receives a signal from 

17 processor 470 along line 935 so that the register block having a number matching that sent by 

1 8 processor 470 outputs true to logic cfrcuits 930 while the other fifteen comparators output 

19 false. Logic circuits 930 control a pair of select lines leading to each of the multiplexors, for 

20 selecting inputs to the multiplexors and therefore controUhig shifting of the register block 

21 numbers. Thus select lines 939 confrol MUXO, select lines 944 control MUX7, select Unes 

22 949 control MUX8, select lines 954 control MUX9 and select Imes 959 conti-ol MUX15. 

23 When a CCB is to be used, processor 470 checks to see whether the CCB matches a CCB 

24 currentiy held in one of the sixteen cache units. If a match is found, the processor sends a 

25 signal along Hne 93 5 with the block number corresponding to that cache unit, for example 

26 number 12. Comparators 920 compare the signal from that line 935 with Ibe block numbers 

27 and comparator C8 provides a tine output for the block R8 that matches the signal, while all 

28 the other comparators output false. Logic drcmts 930, under control from the processor 470, 

29 use select lines 959 to choose the input from line 935 for MUX15, storing the numba- 12 in the 

30 MRU block Rl 5. Logic circuits 930 also send signals along the pairs of select lines for MUX8 

3 1 and higher multiplexors, aside from MUXl 5, to shift their output one block to the left, by 

32 selecting as inputs to each multiplexor MUX8 and higher tiie value that had been stored in 
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1 register blocks one block to the right (R9-R15). The outputs of multiplexors that are to the left 

2 of MUX8 are selected to be constant. 

3 If processor 470 does not find a match for the CCB among the sixteen cache units, on the 

4 other hand, the processor reads from LRU block RO along Une 966 to identify the cache 

5 corresponding to the LRU block, and writes tiie data stored in that cache to DRAM. The 

6 number that was stored in RO, in this case number 3, is chosen by select lines 959 as input 915 

7 to MUXl 5 for storage in MRU block Rl 5. The other fifteen multiplexors output to their 

8 respective register blocks the number; that had been stored each register block immediately to 

9 the right. 

. 10 For the situation in which the processor wishes to remove a CCB from the cache after use, 

^11 the LRU block RO rather than the MRU block Rl 5 is selected for placement of the number 

^ 12 corresponding to the cache unit holding that CCB. The number corresponding to the CCB to 

113 be placed in the LRU block RO for removal from SRAM (for example number 1 , held in block 

14 R9) is sent by processor 470 along line 935, which is matched by comparator C9. The 

15 processor instructs logic circuits 930 to input the number 1 to RO, by selecting with lines 939 
1 16 input 935 to MUXO. Select lines 954 to MUX9 choose as input the number held in register 

, 17 block R8, so that the number from R8 is stored in R9. The numbers held by the other register 

i 18 blocks between RO and R9 are similarly shifted to the right, whereas the numbers in register 

19 blocks to the right of R9 are left constant. This frees scarce cache memory from maintaining 

20 closed CCBs for many cycles while their identifying numbers move through register blocks 

21 from the MRU to the LRU blocks. 
22 

23 
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1 Figure 21 is another diagram of InteUigent Network Interface Card (INIC) 200 of Figure 

2 13. INIC card 200 includes a Physical Layer Interface (PHY) chip 2100, ASIC chip 400 and 

3 Dynamic Random Access Memory (DRAM) 460. PHY chip 2100 couples INIC card 200 to 

4 network line 2 1 0 via a network comiector 2101. INIC card 200 is coupled to the CPU of the 

5 host (for example, CPU 28 of host 20 of Figure 1) via card edge connector 2107 and PCI bus 

6 257. ASIC chip 400 includes a Media Access Control (MAC) unit 402, a sequencers block 

7 2103, SRAM control 442, SRAM 440, DRAM control 450, a queue manager 21 03, a 

8 processor 470, and a PCI bus interface unit 468. Structure and operation of queue manager 

9 21 03 is described above in connection with Figure 1 8 and in U.S. Patent Application Serial 

10 Number 09/41 6,925, entitled "Queue System For Microprocessors", attorney docket no. ALA- 

11 005, filed October 13, 1999, by Daryl D. Starr and CHve M. Philbrick (the subject matter of 

12 which is incorporated herein by reference). Sequencers block 2102 includes a transmit 

13 sequencer 2104, a receive sequencer 2105, and configuration registers 2106. A MAC 

14 destination address is stored in configuration register 2106. Part of the program code executed 

15 by processor 470 is contained in ROM (not shown) and part is located in a writeable control 

16 store SRAM (not shown). The program is downloaded into the writeable control store SRAM 

17 at initialization fi-om the host 20. 

1 8 Figure 22 is a more detailed diagram of receive sequencer 2105. Receive sequencer 2 1 05 

19 includes a data synchronization buffer 2200, a packet synchronization sequencer 2201, a data 

20 assembly register 2202, a protocol analyzer 2203, a packet processing sequencer 2204, a queue 

21 manager interface 2205, and a Direct Memory Access (DMA) control block 2206. The packet 

22 synchronization sequencer 2201 and data synchronization buffer 2200 utilize a network- 

23 synchronized clock of MAC 402, whereas the remainder of the receive sequencer 2 1 05 utilizes 

24 a fixed-frequency clock. Dashed line 2221 indicates the clock domain boundary. 

25 CD Appendix A contains a complete hardware description (verilog code) of an embodiment 

26 of receive sequencer 21 05. Signals in the verilog code are named to designate their fimctions. 

27 Individual sections of the verilog code are identified and labeled with comment lines. Each of 

28 these sections describes hardware in a block of the receive sequencer 2105 as set forth below 

29 in Table 1. 
30 
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SECTION OF VERILOG CODE 



BLOCK OF FIG. 22 



Synchronization Interface 

Sync-Buffer Read-Ptr Synchronizers 

Packet- Synchronization Sequencer 

Data Synchronization Buffer 

Synchronized Status for Link-Destination-Address 

Synchronized Status- Vector 

Synchronization Interface 

Receive Packet Control and Status 

Buffer-Descriptor 

Ending Packet Status 

AssyReg shift-in. Mac -> AssyReg. 

Fifo shift-in. AssyReg -> Sram Fifo 

Fifo ShiftOut Burst. SramFifo -> DramBuffer 

Fly-By Protocol Analyzer; Frame, Network and Transport Layers 

Link Pointer 

Mac address detection 

Magic pattern detection 

Link layer and network layer detection 

Network counter 

Control Packet analysis 

Network header analysis 

Transport layer counter 

Transport header analysis 

Pseudo-header stuff 

Free-Descriptor Fetch 

Receive-Descriptor Store 

Receive- Vector Store 

Queue-manager interface-mux 

Pause Clock Generator 

Pause Timer 



2201 
2201 
2201 

2201 and 2200 
2201 

2201 
2204 
2204 
2201 
2201 

2202 and 2204 
2206 

2206 
2203 
2203 
2203 
2203 
2203 
2203 
2203 
2203 
2203 
2203 
2203 
2205 
2205 
2205 
2205 
2201 
2204 



2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 



TABLE 1 

Operation of receive sequencer 2105 of Figures 21 and 22 is now described in connection 
with the receipt onto INIC card 200 of a TCP/IP packet from network line 210. At 
initialization time, processor 470 partitions DRAM 460 into buffers. Receive sequencer 2105 
uses the buffers in DRAM 460 to store incoming network packet data as well as status 
information for the packet. Processor 470 creates a 32-bit buffer descriptor for each buffer. A 
buffer descriptor indicates the size and location in DRAM of its associated buffer. Processor 
470 places these buffer descriptors on a "free-buffer queue" 2108 by writing the descriptors to 
the queue manager 2103. Queue manager 2103 maintains multiple queues including the "free- 
buffer queue" 2108. In this implementation, the heads and tails of the various queues are 
located in SRAM 440, whereas the middle portion of the queues are located in DRAM 460. 
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1 Lines 2229 comprise a request mechanism involving a request line and address lines. 

2 Similarly, lines 2230 comprise a request mechanism involving a request line and address lines. 

3 Queue manager 2103 uses lines 2229 and 2230 to issue requests to transfer queue information 

4 from DRAM to SRAM or from SRAM to DRAM. 

5 The queue manager interface 2205 of the receive sequencer always attempts to maintain a 

6 free buffer descriptor 2207 for use by the packet processing sequencer 2204. Bit 2208 is a 

7 ready bit that indicates that free-buffer descriptor 2207 is available for use by the packet 

8 processing sequencer 2204. If queue manager interface 2205 does not have a free buffer 

9 descriptor (bit 2208 is not set), then queue manager interface 2205 requests one from queue 

10 manager 2103 via request line 2209. (Request line 2209 is actually a bus which communicates 

1 1 the request, a queue ID, a read/write signal and data if the operation is a write to the queue.) 

12 In response, queue manager 21 03 retrieves a free buffer descriptor from the tail of the "free 

13 buffer queue" 2 1 08 and then alerts the queue manager interface 2205 via an acknowledge 

14 signal on acknowledge line 2210. When queue manager interface 2205 receives the 

15 acknowledge signal, the queue manager interface 2205 loads the free buffer descriptor 2207 

16 and sets the ready bit 2208. Because the free buffer descriptor was in the tail of the free buffo: 

17 queue in SRAM 440, the queue manager interface 2205 actually receives the free buffer 

18 descriptor 2207 from the read data bus 2228 of the SRAM control block 442. Packet 

19 processing sequencer 2204 requests a free buffer descriptor 2207 via request line 22 1 1 . When 

20 the queue manager interface 2205 retrieves the free buffer descriptor 2207 and the free buffer 

21 descriptor 2207 is available for use by the packet processing sequencer, the queue manager 

22 interface 2205 informs the packet processing sequencer 2204 via grant line 2212. By this 

23 process, a free buffer descriptor is made available for use by the packet processing sequencer 

24 2204 and the receive sequencer 21 05 is ready to processes an incoming packet. 

25 Next, a TCP/IP packet is received from the network line 210 via network connector 2101 

26 and Physical Layer Interface (PHY) 2100. PHY 2100 supplies the packet to MAC 402 via a 

27 Media Independent Interface (Mil) parallel bus 21 09. MAC 402 begins processing the packet 

28 and asserts a "start of packet" signal on line 2213 indicating that the beginning of a packet is 

29 being received. When a byte of data is received in the MAC and is available at the MAC 

30 outputs 2215, MAC 402 asserts a "data vahd" signal on line 2214. Upon receiving the "data 

31 valid'' signal, the packet synchronization sequencer 2201 instructs the data synchronization 

32 buffer 2200 via load signal line 2222 to load the received byte from data lines 22 1 5. Data 
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1 synchronization buffer 2200 is four bytes deep. The packet synchronization sequencer 2201 

2 then increments a data synchronization buffer write pointer. This data synchronization buffer 

3 write pointer is made available to the packet processing sequencer 2204 via lines 221 6. 

4 Consecutive bytes of data from data lines 221 5 are clocked into tiie data synchronization 

5 buffer 2200 in this way. 

6 A data synchronization buffer read pointer available on lines 22 1 9 is maintained by the 

7 packet processing sequencer 2204. The packet processing sequencer 2204 determines that 

8 data is available in data synchronization buffer 2200 by comparing the data synchronization 

9 buffer write pointer on lines 22 1 6 with the data synchronization buffer read pointer on lines 

10 2219. 

1 1 Data assembly register 2202 contains a sixteen-byte long shift register 22 1 7. This register 

12 2217 is loaded serially a single byte at a time and is unloaded in parallel. When data is loaded 

1 3 into register 221 7, a write pointer is incremented. This write pointer is made available to the 

14 packet processing sequencer 2204 via lines 221 8. Similarly, when data is unloaded from 

1 5 register 22 1 7, a read pointer maintained by packet processing sequencer 2204 is incremented. 

16 This read pointer is available to the data assembly register 2202 via lines 2220. The packet 

17 processing sequencer 2204 can therefore determine whether room is available in register 221 7 

18 by comparing tiie write pointer on lines 221 8 to the read pointer on lines 2220. 

19 If the packet processing sequencer 2204 determines that room is available in register 2217, 

20 then packet processing sequencer 2204 instructs data assembly register 2202 to load a byte of 

2 1 data from data synchronization buffer 2200. The data assembly register 2202 increments the 

22 data assembly register write pointer on lines 22 1 8 mid the packet processing sequencer 2204 

23 increments the data synchronization buffer read pointer on lines 22 1 9. Data shifted into 

24 register 22 1 7 is examined at the register outputs by protocol analyzer 2203 which verifies 

25 checksums, and generates "status" information 2223. 

26 DMA control block 2206 is responsible for moving information from register 221 7 to 

27 buffer 21 14 via a sixty-four byte receive FIFO 2110. DMA control block 2206 implements 

28 receive FIFO 21 1 0 as two thirty-two byte ping-pong buffers using sixty-four bytes of SRAM 

29 440. DMA control block 2206 implements the receive FIFO using a write-pointer and a read- 

30 pointer. When data to be transferred is available in register 2217 and space is available in 

31 FIFO 2110, DMA control block 2206 asserts an SRAM write request to SRAM controller 442 

32 via lines 2225. SRAM controller 442 in tarn moves data from register 2217 to FIFO 21 10 and 
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1 asserts an acknowledge signal back to DMA control block 2206 via lines 2225. DMA control 

2 block 2206 then increments the receive FIFO write pointer and causes the data assembly 

3 register read pointer to be incremented. 

4 When thirty-two bytes of data has been deposited into receive FIFO 2 11 0, DMA control 

5 block 2206 presents a DRAM write request to DRAM controller 450 via lines 2226. This 

6 write request consists of the free buffer descriptor 2207 ORed with a "buffer load count" for 

7 the DRAM request address, and the receive FIFO read pointer for the SRAM read address. 

8 Using the receive FIFO read pointer, the DRAM controller 450 asserts a read request to 

9 SRAM controller 442. SRAM controller 442 responds to DRAM controller 450 by returning 
10 the indicated data from the receive FIFO 2 11 0 in SRAM 440 and asserting an acknowledge 

S 1 1 signal. DRAM confroUer 450 stores the data in a DRAM write data register, stores a DRAM 

D 12 request address in a DRAM address register, and asserts an acknowledge to DMA confrol 

M 13 block 2206. The DMA control block 2206 then decrements the receive FIFO read pointer. 

^ 14 Then the DRAM controller 450 moves the data from the DRAM write data register to buffer 

O 1 5 21 1 4. In this way, as consecutive thirty-two byte chunks of data are stored in SRAM 440, 

L 1 6 DRAM control block 2206 moves those thirty-two byte chunks of data one at a time from 

P 17 SRAM 440 to buffer 2214 in DRAM 460. Transferring thirty-two byte chunks of data to the 

^ J 1 8 DRAM 460 in this fashion allows data to be written into the DRAM using the relatively 

ft 19 efBcient burst mode of the DRAM. 

20 Packet data contmues to flow from network Ime 21 0 to buffer 2114 until all packet data has 

2 1 been received. MAC 402 then indicates that the incoming packet has completed by asserting 

22 an "end of frame" (i.e., end of packet) signal on line 2227 and by presenting final packet status 

23 (MAC packet status) to packet synchronization sequencer 2204. The packet processing 

24 sequCTLcer 2204 then moves the status 2223 (also called "protocol analyzer status") and the 

25 MAC packet status to register 22 1 7 for eventual transfer to buffer 2114. After all the data of 

26 the packet has been placed in buffer 22 1 4, status 2223 and the MAC packet status is 

27 fransfarred to buffer 22 14 so that it is stored prepended to the associated data as shown in 

28 Figure 22. 

29 After all data and status has been transferred to buffer 21 14, packet processing sequencer 

30 2204 creates a summary 2224 (also called a "receive packet descriptor") by concatenating the 

3 1 free buffer descriptor 2207, the buffer load-count, the MAC ID, and a status bit (also called an 

32 "attention bit"). If tiie attention bit is a one, then the packet is not a "fast-path candidate"; 
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1 whereas if the attention bit is a zero, then the packet is a "fast-path candidate''. The value of 

2 the attention bit represents the result of asignificant amount of processing that processor 470 

3 would otherwise have to do to determine whether the packet is a "fast-path candidate". For 

4 example, the attention bit being a zero indicates that the packet employs both TCP protocol 

5 and IP protocol. By carrying out this significant amount of processing in hardware beforehand 

6 and then encoding the result in the attention bit, subsequent decision making by processor 470 

7 as to whether the packet is an actual "fast-path packet" is accelerated. A complete logical 

8 description of the attention bit in verilog code is set forth in CD Appendix A in the lines 

9 following the heading "Ending Packet Status". 

10 Packet processing sequencer 2204 then sets a ready bit (not shown) associated with 

1 1 summary 2224 and presents summary 2224 to queue manager interface 2205. Queue manager 

12 interface 2205 then requests a write to the head of a "summary queue" 21 12 (also called the 

13 "receive descriptor queue"). The queue manager 2 1 03 receives the request, writes the 

14 summary 2224 to the head of the summary queue 2212, and asserts an acknowledge signal 

1 5 back to queue manager interface via line 22 1 0. When queue manager interface 2205 receives 

16 the acknowledge, queue manager interface 2205 informs packet processing sequencer 2204 

17 that the summary 2224 is in summary queue 2212 by clearing the ready bit associated with the 

18 summary. Packet processing sequencer 2204 also generates additional status information (also 

19 called a 'Vector") for the packet by concatenating the MAC packet status and the MAC ID. 

20 Packet processing sequencer 2204 sets a ready bit (not shown) associated with this vector and 

21 presents this vector to the queue manager interface 2205. The queue manager interface 2205 

22 and the queue manager 21 03 then cooperate to write this vector to the head of a "vector queue" 

23 2 1 1 3 in similar fashion to the way summary 2224 was written to the head of summary queue 

24 21 12 as described above. When the vector for the packet has been written to vector queue 

25 2113, queue manager interface 2205 resets the ready bit associated wilh the vector. 

26 Once summary 2224 (including a buffer descriptor that points to buffer 2 1 1 4) has been 

27 placed in summary queue 2 1 12 and the packet data has been placed in buffer 2 1 44, processor 

28 470 can retrieve summary 2224 from summary queue 2112 and examine the "attention bit". 

29 If the attention bit from summary 2224 is a digital one, then processor 470 determines that 

30 the packet is not a "fast-path candidate" and processor 470 need not examine the packet 

31 headers. Only the status 2223 (first sixteen bytes) from buffer 2114 are DMA transferred to 

32 SRAM so processor 470 can examine it. If the status 2223 indicates that the packet is a type 
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1 of packet that is not to be transferred to the host (for example, a multicast frame that the host is 

2 not registered to receive), then the packet is discarded (i.e., not passed to the host). If status 

3 2223 does not indicate that the packet is the type of packet that is not to be transferred to the 

4 host, then the entire packet (headers and data) is passed to a buffer on host 20 for "slow-path" 

5 transport and network layer processing by the protocol stack of host 20. 

6 If, on the other hand, the attention bit is a zero, then processor 470 determines that the 

7 packet is a "fast-path candidate". If processor 470 determines that the packet is a "fast-path 

8 candidate", then processor 470 uses the buffer descriptor from the summary to DMA transfer 

9 the first approximately 96 bytes of information from buffer 21 14 from DRAM 460 into a 

10 portion of SRAM 440 so processor 470 can examine it. This first approximately 96 bytes 

1 1 contains status 2223 as well as the IP source address of the IP header, the IP destination 

12 address of the IP header, the TCP source address of the TCP header, and the TCP destination 

13 address of the TCP header. The IP source address of the IP header, the IP destination address 

14 of the IP header, the TCP source address of the TCP header, and the TCP destination address 

1 5 of the TCP header together uniquely define a single connection context (TCB) with which the 

16 packet is associated. Processor 470 examines these addresses of the TCP and IP headers and 

17 determines the connection context of the packet. Processor 470 then checks a list of 

1 8 connection contexts that are under the control on INIC card 200 and determines whether the 

1 9 packet is associated with a connection context (TCB) under the control of INIC card 200. 

20 If the connection context is not in the list, then the "fast-path candidate" packet is 

21 determined not to be a "fast-path packet." In such a case, the entire packet (headers and data) 

22 is transferred to a buffer in host 20 for "slow-path" processing by the protocol stack of host 20. 

23 If, on the other hand, the connection context is in the list, then software executed by 

24 processor 470 including software state machines 223 1 and 2232 checks for one of numerous 

25 exception conditions and determines whether the packet is a "fast-path packet" or is not a 

26 "fast-path packet". These exception conditions include: 1) IP fragmentation is detected; 2) an 

27 IP option is detected; 3) an unexpected TCP flag (urgent bit set, reset bit set, SYN bit set or 

28 FIN bit set) is detected; 4) the ACK field in the TCP header is before the TCP window, or the 

29 ACK field in the TCP header is after the TCP window, or the ACK field in the TCP header 

30 shrinks the TCP window; 5) the ACK field in the TCP header is a dupUcate ACK and the 

31 ACK field exceeds the dupHcate ACK count (the dupHcate ACK count is a user settable 

32 value); and 6) the sequence number of the TCP header is out of order (packet is received out of 
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1 sequence). If the software executed by processor 470 detects one of these exception 

2 conditions, then processor 470 determines that the "fast-path candidate" is not a "fast-path 

3 packet." In such a case, the connection context for the packet is "flushed" (the connection 

4 context is passed back to the host) so that the connection context is no longer present in the list 

5 of connection contexts under control of INIC card 200. The entire packet (headers and data) is 

6 transferred to a buffer in host 20 for "slow-path" transport layer and network layer processing 

7 by the protocol stack of host 20. 

8 If, on the other hand, processor 470 finds no such exception condition, then the "fast-path 

9 candidate" packet is determined to be an actual "fast-path packet". The receive state machine 

10 2232 then processes of the packet through TCP. The data portion of the packet in buffer 21 14 

1 1 is then transferred by another DMA controller (not shown in Figure 21) from buffer 21 14 to a 

12 host-allocated file cache in storage 35 of host 20. In one embodiment, host 20 does no 

13 analysis of the TCP and IP headers of a "fast-path packet". All analysis of the TCP and IP 

14 headers of a "fast-path packet" is done on INIC card 20. 

15 Figure 23 is a diagram illustrating the transfer of data of "fast-path packets" (packets of a 

16 64k-byte session layer message 2300) from INIC 200 to host 20. The portion of the diagram 

17 to the left of the dashed line 2301 represents INIC 200, whereas the portion of the diagram to 

18 the right of the dashed line 2301 represents host 20. The 64k-byte session layer message 2300 

19 includes approximately forty-five packets, four of which (2302, 2303, 2304 and 2305) are 

20 labeled on Figure 23 . The first packet 2302 includes a portion 2306 containing transport and 

21 network layer headers (for example, TCP and IP headers), a portion 2307 contammg a session 

22 layer header, and a portion 2308 containing data. In a first step, portion 2307, the first few 

23 bytes of data from portion 2308, and the connection context identifier 2310 of the packet 2300 

24 are fransferred from INIC 200 to a 256-byte buffer 2309 in host 20. hi a second step, host 20 

25 examines this information and returns to INIC 200 a destination (for example, the location of a 

26 file cache 23 1 1 in storage 35) for the data. Host 20 also copies the first few bytes of the data 

27 from buffer 2309 to the beginning ofa first part 2312 offile cache 2311. In a third step, INIC 

28 200 transfers the remainder of the data fix)m portion 2308 to host 20 such that the remainder of 

29 the data is stored in Ihe remamder of first part 23 12 of file cache 23 1 1 . No network, fransport, 

30 or session layer headers are stored in first part 23 12 of file cache 23 1 1 . Next, the data portion 

31 23 1 3 of the second packet 23 03 is transferred to host 20 such tiiat the data portion 23 1 3 of the 
32 . second packet 2303 is stored in a second part 23 14 of file cache 23 1 1 . The transport layer and 
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1 network layer header portion 2315 of second packet 2303 is not transferred to host 20. There 

2 is no network, transport, or session layer header stored in file cache 23 1 1 between the data 

3 portion of first packet 2302 and the data portion of second packet 2303. Similarly, the data 

4 portion 23 1 6 of the next packet 2304 of the session layer message is transferred to file cache 

5 23 11 so that there is no network, transport, or session layer headers between the data portion 

6 of the second packet 2303 and the data portion of the third packet 2304 in file cache 23 1 1 . In 

7 this way, only the data portions of the packets of the session layer message are placed in the 

8 file cache 23 1 1 . The data fi-om the session layer message 2300 is present in file cache 23 1 1 as 

9 a block such that this block contains no network, transport, or session layer headers. 

10 In the case of a shorter, single-packet session layer message, portions 2307 and 2308 of the 

1 1 session layer message are transferred to 256-byte buffer 2309 of host 20 along with the 

12 connection context identifier 23 1 0 as in the case of the longer session layer message described 

13 above. In the case of a single-packet session layer message, however, the transfer is completed 

14 at this point. Host 20 does not return a destination to INIC 200 and INIC 200 does not transfer 

15 subsequent data to such a destination. 

16 CD Appendix B includes a listing of software executed by processor 470 that determines 

17 whether a "fast-path candidate" packet is or is not a "fast-path packet". An example of the 

18 instruction set of processor 470 is found starting on page 79 of the Provisional U.S. Patent 

19 Application Serial No. 60/061 ,809, entitled "Intelligent Network Interface Card And System 

20 For Protocol Processing", filed October 14, 1997 (the subject matter of this provisional 

2 1 application is incorporated herein by reference). 

22 CD Appendix C includes device driver software executable on host 20 that interfaces the 

23 host 20 to INIC card 200. There is also ATCP code tiiat executes on host 20. This ATCP 

24 code includes: 1) a "free BSD" stack (available from the University of California, Berkeley) 

25 that has been modified sUghtiy to make it run on the NT4 operating system (the "free BSD" 

26 stack normally runs on a UNIX machine), and 2) code added to the free BSD stack between 

27 the session layer above and the device driver below that enables the BSD stack to carry out 

28 "fast-patii" processing in conjunction with INIC 200. 
29 

30 TRANSMIT FAST-PATH PROCESSING: The following is an overview of one 

3 1 embodiment of a fransmit fast-path flow once a command has been posted (for additional 

32 information, see provisional application 60/098,296, filed August 27, 1 998). The transmit 
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1 request may be a segment that is less than the MSS, or it may be as much as a full 64K session 

2 layer packet. The former request will go out as one segment, the latter as a number of MSS- 

3 sized segments. The transmitting CCB must hold on to the request until all data in it has been 

4 transmitted and ACKed. Appropriate pointers to do this are kept in the CCB. To create an 

5 output TCP/IP segment, a large DRAM buffer is acquired from the Q_FREEL queue. Then 

6 data is DMAd from host memory into the DRAM buffer to create an MSS-sized segment. 

7 This DMA also checksums the data. The TCP/IP header is created in SRAM a&d DMAd to 

8 the front of the payload data. It is quicker and simpler to keep a basic frame header (i.e., a 

9 template header) permanently in the CCB and DMA this directly from the SRAM CCB buffer 

10 into the DRAM buffer each time. Thus the payload checksum is adjusted for the pseudo- 

1 1 header (i.e., the template header) and placed into the TCP header prior to DMAing the header 

12 from SRAM. Then the DRAM buffer is queued to the appropriate Q_UXMT transmit queue. 

13 The final step is to update various window fields etc in the CCB. Eventually either tbe entire 

14 request will have been sent and ACKed, or a retransmission timer will expire in which case the 

15 context is flushed to the host. In either case, the INIC will place a command response in the 

1 6 response queue containing the command buffer from the original transmit command and 

17 appropriate status. 

18 The above discussion has dealt with how an actual transmit occurs. However the real 

19 challenge in the transmit processor is to determine whether it is appropriate to transmit at the 

20 time a transmit request arrives, and then to continue to transmit for as long as the transport 

21 protocol permits. There are many reasons not to fransmit: the receiver's window size is less 

22 than or equal to zero, the persist timer has expired, the amount to send is less than a full 

23 segment and an ACK is expected/outstanding, the receiver's window is not half-open, etc. 

24 Much of transmit processing will be in determining these conditions. 

25 The fast-path is implemented as a finite state machine (FSM) that covers at least three 

26 layers of the protocol stack, i.e., IP, TCP, and Session. The following summarizes the steps 

27 involved in normal fast-path transmit command processing: 1) get control of the associated 

28 CCB (gotten from tihe command): this involves locking the CCB to stop other processing (e.g. 

29 Receive) from altering it while this fransmit processing is taking place. 2) Get the CCB into 

30 an SRAM CCB buffer. There are sixteen of these buffers in SRAM and they are not flushed to 

31 DRAM until the buffer space is needed by other CCBs. Acquisition and flushing of these 

32 CCB buffers is controlled by a hardware LRU mechanism. Thus getting into a buffer may 
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involve flushing another CCB from its SRAM buffer. 3) Process the send command 

2 (EX_SCMD) event against the CCB's FSM. 

3 Each event and state intersection provides an action to be executed and a new state. The 

4 following is an example of the state/event transition, the action to be executed and the new 
state for the SEND command while in transmit state IDLE (SXJDLE). The action from this 
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6 state/event intersection is AX_NUCMD and the next state is XMIT COMMAND ACTIVE 

7 (SX_XMIT). To summarize, a command to fransmit data has been received while transmit is 

8 currently idle. The action performs the following steps: 1) Store details of the command into 

9 the CCB. 2) Check that it is okay to transmit now (e.g. send window is not zero). 3) If output 

10 is not possible, send the Check Output event to Q_EVENT1 queue for the Transmit CCB's 

1 1 FSM and exit. 4) Get a DRAM 2K-byte buffer from the Q-FREEL queue into which to move 

12 the payload data. 5) DMA payload data from the addresses in the scatter/gather lists in the 

13 command into an offset in the DRAM buffer that leaves space for the frame header. These 

14 DMAs will provide the checksum ofthe payload data. 6) Concurrently with the above DMA, 

15 fill out variable details in the frame header template in the CCB. Also get the IP and TCP 

16 header checksums while doing this. Note that base IP and TCP headers checksums are kept in 

17 the CCB, and these are simply updated for fields that vary per frame, viz. IP Id, IP length, IP 

18 checksum, TCP sequence and ACK numbers, TCP window size, TCP flags and TCP 

19 checksum. 7) When the payload is complete, DMA the firame header from the CCB to the 

20 front ofthe DRAM buffer. 8) Queue the DRAM buffer (i.e., queue a buffer descriptor that 

21 points to the DRAM buffer) to the appropriate QJJXMT queue for the interface for this CCB. 

22 9) Determine if there is more payload in the command. If so, save the current command 

23 transfer address details in the CCB and send a CHECK OUTPUT event via the Q_EVENT1 

24 queue to the Transmit CCB. If not, send the ALL COMMAND DATA SENT (EX_ACDS) 

25 event to the Transmit CCB. 10) Exit from Transmit FSM processing. 

26 Code that implements an embodiment ofthe Transmit FSM (transmit software state 

27 machine 223 1 of Figure 21) is found in CD Appendix B. In one embodiment, fast-path 

28 transmit processing is controlled using write only transmit configuration register (XmtCfg). 

29 Register XmtCfg has the following portions: 1) Bit 3 1 (name: Reset). Writing a one (1) will 

30 force reset asserted to the transmit sequencer ofthe charmel selected by XcvSel. 2) Bit 30 

31 (name: XmtEn). Writing a one (1) allows the transmit sequencer to run. Writing a zero (0) 

32 causes the transmit sequencer to halt after completion ofthe current packet. 3) Bit 29 (name: 
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1 PauseEn). Writing a one (1) allows the transmit sequencer to stop packet transmission, after 

2 completion of the current packet, whenever the receive sequencer detects an 802.3X pause 

3 command packet. 4) Bit 28 (name: LoadRng). Writing a one (1) causes the data in 

4 RcvAddrB[10:00] to be loaded in to the Mac's random number register for use during 

5 collision back-offs. 5) Bits 27:20 (name: Reserved). 6) Bits 19:15 (name: FreeQId). Selects 

6 the queue to which the freed buffer descriptors will be written once the packet transmission 

7 has been terminated, either successfully or unsuccessfully. 7) Bits 14:10 (name: XmtQId). 

8 Selects the queue from which the transmit buffer descriptors will be fetched for data packets. 

9 8) Bits 09:05 (name: CtriQId), Selects the queue from which the transmit buffer descriptors 

10 will be fetched for control packets. These packets have transmission priority over the data 

1 1 packets and will be exhausted before data packets will be transmitted. 9) Bits 04:00 (name: 

12 VectQId). Selects the queue to which the transmit vector data is written after the completion 

13 of each packet transmit. In some embodiments, transmit sequencer 2104 of Figure 21 retrieves 

14 buffer descriptors from two transmit queues, one of the queues having a higher transmission 

1 5 priority than the other. The higher transmission priority transmit queue is used for the 

16 transmission of TCP ACKs, whereas the lower transmission priority transmit queue is used for 

17 the transmission of other types of packets. ACKs may be transmitted in accordance with 

18 techniques set forth in U.S. Patent Application Serial No. 09/802,426 (the subject matter of 

19 which is incorporated herein by reference). In some embodiments, the processor that executes 

20 the Transmit FSM, the receive and transmit sequencers, and the host processor that executes 

21 the protocol stack are all realized on the same printed circuit board. The printed circuit board 

22 may, for example, be a card adapted for coupling to another computer. 

23 All told, the above-described devices and systems for processing of data communication 

24 result in dramatic reductions in the time and host resources required for processing large, 

25 connection-based messages. Protocol processing speed and efficiency is tremendously 

26 accelerated by specially designed protocol processing hardware as compared with a general 

27 purpose CPU running conventional protocol software, and interrupts to the host CPU are also 

28 substantially reduced. These advantages can be provided to an existing host by addition of an 

29 intelligent network interface card (INIC), or the protocol processing hardware may be 

30 integrated with the CPU. In either case, the protocol processing hardware and CPU 

31 intelligently decide which device processes a given message, and can change the allocation of 

32 . that processing based upon conditions of the message. 
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